When you find yourself analyzing a new sample there are different aspects you might be interested in. The most popular one is the center of the data, which is normally described through the mean and the median. However there are more things to care about, and in this topic you'll get to know one of the most important ones, the dispersion of your data.
Measuring the variability in the data isn't straightforward and there are several approaches. Here, you'll explore the variance as a kind of average deviation in the data. You'll build the concept from scratch, meet several related concepts like the standard deviation and even use a programming language to simplify all the computations.
Data dispersion
When we talk about data, the mean often takes the spotlight as the go-to summary statistic. But only a number that represents the center of the data isn't enough to understand your sample. It doesn't tell you anything about the relationship between the data points, its dispersion or its behavior away from the the center.
You can notice this easily by examining some real data. The following three samples have the same mean, . If you only take this into account you might imagine them as being quite similar, right? Take a look at the samples and note their differences:
Although the samples share the mean, they behave differently around it. The main difference is their dispersion. The green sample is tightly concentrated around , the purple one has a bit wider range, and the orange sample is really scattered. As a result, the mean doesn't tell you the true story of your data!
How can you measure data dispersion? At first sight it might look like a difficult task. The variance is the statistic that quantifies such data variability and the key point is to measure how far is each data point from the mean. Let's build this new idea!
Building the definition
Take a random sample of size with mean . The first step is to determine the deviation of each data point from the mean. The deviations from the mean are obtained by subtracting the mean from each observation:
You can visualize the deviations from the mean by tracing a line from each observation to the mean, as in the following diagram:
Here, there are observations and their values are the points. The dashed line represents the mean and each line segment is the dispersion of the corresponding observation. As you can see, some points are closer to the mean than others. In particular, the sixth observation has the largest dispersion, whereas the fifth one coincides with the mean.
Now you can combine the deviations into a single quantity by averaging them. But don't rush! There's a tiny problem when you compute the sum:
It looks like the positive and negative deviations counteract one another! In consequence, the average is always zero. The solution that statisticians found to this fiasco is to use quadratic deviations instead. By averaging these quantities we obtain the population variance:
It turns out that this number works best when you have the whole population. But in real life you only have access to a sample. For this reason it's more common to use a slightly modification known as the sample variance:
The small modification in the denominator is due to subtle aspects of statistical inference that you don't need to worry about right now. Just remember that most of the time you will be using the sample variance. Because of the squared deviations, you need to take the square root to recover the same unit as the sample:
The new quantity is the famous standard deviation and you'll see it everywhere.
The sample standard deviation represents the size of a typical deviation from the sample mean in the sample.
For instance, consider the case when . In this situation, the typical distance from the mean for each observation is . Some are closer and some other are farther, but the general tendency is .
Measuring variability
By now you'll have realized that calculating the variance and standard deviation isn't as simple as calculating the mean. In practice you won't have to do all the computations by hand, you'll most likely use a programming language or a statistical software, so don't worry. For now, let's look at an example where you practice the formulas a little and then use Python to simplify the process.
Do you like McDonald's French fries? Like some other foods, they contain acrylamide, which is a potential carcinogen. The FDA visited branches and bought some French fries. After analyzing the food, the levels of acrylamide (in micrograms per kilogram of food) are:
The sample size is and the mean is . Now you can compute the deviations and the squared deviations, try to verify the following computations:
Feel free to visualize the deviations as you did before:
Hence, the total squared deviation is . What a huge number! This is perfectly reasonable since you're working with squared units. Now, the sample variance is:
By taking the square root we get that the standard deviation is . Since the standard deviation has the same units as the sample, it's more useful than the sample variance.
In summary, you can interpret that levels of acrylamide are on average and that its dispersion is . In other words, the sample is concentrated at and the typical distance between the observations and the center is .
When doing statistics with Python, the numpy package is by far the preferred option. A sample is normally stored in an array this way:
import numpy as np
sample = np.array([437, 232, 374, 142, 305, 287, 242])In order to compute the variance you can use the function np.var. By default it computes the population variance. You can get the sample variance by specifying the argument ddof to 1 (by default it's set to 0):
np.var(sample) # population variance
np.var(sample, ddof=1) # sample varianceComputing the standard deviation is as easy as use the function np.std with the same argument:
np.std(sample, ddof=1) Conclusion
Consider a sample of size with mean .
The mean isn't enough to understand the sample, it is also important to quantify its dispersion.
The deviations from the mean are defined as:
The population variance is the average of the squared deviations from the mean:
In practice, the sample variance is preferred:
Since the sample variance has squared units, it's common to take its square root, known as the standard deviation:
The sample standard deviation represents the size of a typical deviation from the sample mean in the sample.