MathStatisticsDescriptive statistics

Variance

5 minutes read

When you find yourself analyzing a new sample there are different aspects you might be interested in. The most popular one is the center of the data, which is normally described through the mean and the median. However there are more things to care about, and in this topic you'll get to know one of the most important ones, the dispersion of your data.

Measuring the variability in the data isn't straightforward and there are several approaches. Here, you'll explore the variance as a kind of average deviation in the data. You'll build the concept from scratch, meet several related concepts like the standard deviation and even use a programming language to simplify all the computations.

Data dispersion

When we talk about data, the mean often takes the spotlight as the go-to summary statistic. But only a number that represents the center of the data isn't enough to understand your sample. It doesn't tell you anything about the relationship between the data points, its dispersion or its behavior away from the the center.

You can notice this easily by examining some real data. The following three samples have the same mean, 3030. If you only take this into account you might imagine them as being quite similar, right? Take a look at the samples and note their differences:

Different samples

Although the samples share the mean, they behave differently around it. The main difference is their dispersion. The green sample is tightly concentrated around 3030, the purple one has a bit wider range, and the orange sample is really scattered. As a result, the mean doesn't tell you the true story of your data!

How can you measure data dispersion? At first sight it might look like a difficult task. The variance is the statistic that quantifies such data variability and the key point is to measure how far is each data point from the mean. Let's build this new idea!

Building the definition

Take a random sample x1,x2,,xnx_1, x_2, \dots, x_n of size nn with mean xˉ\bar{x}. The first step is to determine the deviation of each data point from the mean. The deviations from the mean are obtained by subtracting the mean from each observation:

x1xˉ, x2xˉ,,xnxˉx_1 - \bar{x}, \ x_2 - \bar{x}, \dots, x_n - \bar{x}

You can visualize the deviations from the mean by tracing a line from each observation to the mean, as in the following diagram:

Deviations from the mean

Here, there are 1010 observations and their values are the points. The dashed line represents the mean and each line segment is the dispersion of the corresponding observation. As you can see, some points are closer to the mean than others. In particular, the sixth observation has the largest dispersion, whereas the fifth one coincides with the mean.

Now you can combine the deviations into a single quantity by averaging them. But don't rush! There's a tiny problem when you compute the sum:

i=1n(xixˉ)=i=1nxii=1nxˉ=i=1nxin xˉ=i=1nxin i=1nxin=i=1nxii=1nxi=0\sum_{i=1}^n (x_i -\bar{x}) = \sum_{i=1}^n x_i - \sum_{i=1}^n\bar{x} = \sum_{i=1}^n x_i - n \ \bar{x} = \sum_{i=1}^n x_i - n \ \frac{\sum_{i=1}^n x_i }{n} = \sum_{i=1}^n x_i - \sum_{i=1}^n x_i = 0

It looks like the positive and negative deviations counteract one another! In consequence, the average is always zero. The solution that statisticians found to this fiasco is to use quadratic deviations instead. By averaging these quantities we obtain the population variance:

i=1n(xixˉ)2n\frac{\sum_{i=1}^n (x_i -\bar{x})^2}{n}

It turns out that this number works best when you have the whole population. But in real life you only have access to a sample. For this reason it's more common to use a slightly modification known as the sample variance:

s2=i=1n(xixˉ)2n1s^2 = \frac{\sum_{i=1}^n (x_i -\bar{x})^2}{n-1}

The small modification in the denominator is due to subtle aspects of statistical inference that you don't need to worry about right now. Just remember that most of the time you will be using the sample variance. Because of the squared deviations, you need to take the square root to recover the same unit as the sample:

s=s2=i=1n(xixˉ)2n1s = \sqrt{s^2} = \sqrt{ \frac{\sum_{i=1}^n (x_i -\bar{x})^2}{n-1}}

The new quantity is the famous standard deviation and you'll see it everywhere.

The sample standard deviation represents the size of a typical deviation from the sample mean in the sample.

For instance, consider the case when s=4s=4. In this situation, the typical distance from the mean for each observation is 44. Some are closer and some other are farther, but the general tendency is 44.

Measuring variability

By now you'll have realized that calculating the variance and standard deviation isn't as simple as calculating the mean. In practice you won't have to do all the computations by hand, you'll most likely use a programming language or a statistical software, so don't worry. For now, let's look at an example where you practice the formulas a little and then use Python to simplify the process.

Do you like McDonald's French fries? Like some other foods, they contain acrylamide, which is a potential carcinogen. The FDA visited 77 branches and bought some French fries. After analyzing the food, the levels of acrylamide (in micrograms per kilogram of food) are:

437232374142305287242437 \quad 232 \quad 374 \quad 142 \quad 305 \quad 287 \quad 242

The sample size is 77 and the mean is xˉ=288.43\bar{x} = 288.43. Now you can compute the deviations and the squared deviations, try to verify the following computations:

xix_i

xixˉx_i - \bar{x}

(xixˉ)2(x_i - \bar{x})^2

437437

148.57148.57

22,073.4722,073.47

232232

56.43-56.43

3,184.183,184.18

374374

85.5785.57

7,322.477,322.47

142142

146.43-146.43

21,441.3321,441.33

305305

16.5716.57

274.61274.61

287287

1.43-1.43

2.042.04

242242

46.43-46.43

2,155.612,155.61

Feel free to visualize the deviations as you did before:

Observation's variabilty

Hence, the total squared deviation is i=1n(xixˉ)2=56,453.71\sum_{i=1}^n (x_i -\bar{x})^2 = 56,453.71. What a huge number! This is perfectly reasonable since you're working with squared units. Now, the sample variance is:

s2=i=1n(xixˉ)2n1=56,453.716=9,408.95s^2 = \frac{\sum_{i=1}^n (x_i -\bar{x})^2}{n-1} = \frac{56,453.71}{6} = 9,408.95

By taking the square root we get that the standard deviation is 9797. Since the standard deviation has the same units as the sample, it's more useful than the sample variance.

In summary, you can interpret that levels of acrylamide are on average 288.43288.43 and that its dispersion is 9797. In other words, the sample is concentrated at 288.43288.43 and the typical distance between the observations and the center is 9797.

When doing statistics with Python, the numpy package is by far the preferred option. A sample is normally stored in an array this way:

import numpy as np
sample = np.array([437, 232, 374, 142, 305, 287, 242])

In order to compute the variance you can use the function np.var. By default it computes the population variance. You can get the sample variance by specifying the argument ddof to 1 (by default it's set to 0):

np.var(sample)  # population variance
np.var(sample, ddof=1)  # sample variance

Computing the standard deviation is as easy as use the function np.std with the same argument:

np.std(sample, ddof=1) 

Conclusion

Consider a sample x1,x2,,xnx_1, x_2, \dots, x_n of size nn with mean xˉ\bar{x}.

  • The mean xˉ\bar{x} isn't enough to understand the sample, it is also important to quantify its dispersion.

  • The deviations from the mean are defined as:

    x1xˉ, x2xˉ,,xnxˉx_1 - \bar{x}, \ x_2 - \bar{x}, \dots, x_n - \bar{x}
  • The population variance is the average of the squared deviations from the mean:

    i=1n(xixˉ)2n\frac{\sum_{i=1}^n (x_i -\bar{x})^2}{n}
  • In practice, the sample variance is preferred:

    s2=i=1n(xixˉ)2n1s^2 = \frac{\sum_{i=1}^n (x_i -\bar{x})^2}{n-1}
  • Since the sample variance has squared units, it's common to take its square root, known as the standard deviation:

    s=s2=i=1n(xixˉ)2n1s = \sqrt{s^2} = \sqrt{ \frac{\sum_{i=1}^n (x_i -\bar{x})^2}{n-1}}
  • The sample standard deviation represents the size of a typical deviation from the sample mean in the sample.

How did you like the theory?
Report a typo