In our intro to math statistics, you already dealt with samples and their basic characteristics (statistics), namely the mean, the median, and the mode. Now you can easily distinguish these three and use them for sample analysis. However, these values often are not enough for the correct sample analysis. Today let's consider when it could happen and get acquainted with two new concepts: variance and standard deviation.
When is it not enough?
Mean, mode, and median are used not only to analyze the sample itself but for the sample comparison (because comparing several numbers is much easier than analyzing samples' elements one by one). To understand why the concepts you've learned could be not really informative, let's look at the example below.
For the readers' convenience, we sorted the samples. Bold numbers are the middle elements (median). Now let's inspect their mean, median, and mode. For the :
For the :
A bit confusing results, aren't they? You have two different samples with identically equivalent mean, mode, and median. This instance illustrates that sometimes it is insufficient just to consider these three values and compare samples' data. And here you go!
Standard deviation
The statistical dispersion shows how much sample values are spread or compressed. This value is a non-negative value, which characterizes the sample diversity. The higher the value, the greater the spread between the sample elements. Zero value means that all elements are equal.
One of the common examples of dispersion measures is a standard deviation. It is evaluated according to the formula:
where (read as "sigma") is a standard deviation, — the size of the sample,
— sample arithmetic mean.
If you're scared a bit or confused, don't worry! Let's open this formula up, step-by-step
First of all, you should evaluate the sample's mean . You will use it as a reference value and analyze the sample.
Then subtract the mean value from each sample element . That means that for standard deviation, it's more important to know not the element value, but how "far" is an element from the mean. For example, in the picture below, you see that observation is further from the mean than , so it will introduce a larger "spread" in the standard deviation.
3. After that, square each resulting difference . Why? In the picture above, you can see that difference could be both positive and negative. To expel sign influence, square each value – after that, all of them are positive.
4. Sum these squares and divide this sum by :
5. the square root. It is necessary to do because of the dimension: to expel sign effect, we squared our summands, and if, for example, was measured in dollars . And the sense of a value with such a dimension is not very illustrative relative to the original sample.
One more dispersion measurement is named variance. In fact, it is just a squared standard deviation (step-4-sum). Why do we need it, if it is just squared sigma and has a non-interpreted dimension? In lots of math statistics theorems, it is exactly the variance that is used. That is why gained its own name.
Time to calculate
Let's evaluate the standard deviation for the and . You remember, that
As said above, the standard deviation equals since consists of identical elements.
To avoid the huge formula for with many squared parentheses, first evaluate this summand:
Then could be written as:
According to the and , you can now see that there is a noticeable difference between those 2 samples. In the first one, all elements are equal to the mean, while in the second, there is a diversity in the sample's values.
Interpret a bit.
If we were considering a larger sample, we could make a standard deviation visual interpretation. And range around the mean would include about of the sample elements. That is why the standard deviation is used to characterize the sample: the larger it is, the greater the variation in the values of the sample elements.
Population and sample
In statistics, there are two related expressions that you will always deal with, which are population and sample,
Population refers to the entire group of individuals or objects that share a common characteristic. A sample is a subset of the population that is selected for the study.
For Example, in research to measure the average yearly income of individuals in a city, the population is the income of every single individual in the city. However, in practice, it will not be easy to collect this data, as you will not go to every individual and ask them about their income. Rather than this, you take samples of individuals, and studying samples help you to make inferences about the population.
Previously, our calculations revolved around the standard deviation of samples, as it is the most commonly employed method. However, if you now have access to the entire dataset, it is referred to as the population. Consequently, the formula for determining the standard deviation undergoes a modification.
Standard deviation for population
In order to estimate population standard deviation and how data is spread, statisticians and mathematicians divide squared differences from the mean over the population size directly (), rather than subtracting one from the sample size ().
So, the formula to calculate the standard deviation for the population of data is like the following :
Example: is the Startup far from home?
A manager of a small startup that consists of only ten employees wanted to study the time spent by the employees on their road to the startup place.
They made a questionnaire for the employees and asked them for this information.
And found time spent in minutes for each employee as 16, 21, 25, 26, 29, 30, 30, 32, 40, and 45 minutes.
The average time spent on the road () is
The standard deviation for the sample
Plugging data, you get
Therefore, the standard deviation is:
Conclusion
Today you learned a new sample characteristic — dispersion — and considered the most widespread one, namely standard deviation. The key points are:
This value is sometimes more reliable than just mean, median, and mode. It takes time to calculate it, but you have more information about our sample (and population). The larger it is, the greater the variation in the values of the sample elements.
Graphically, the range around the mean would include about 99% of the sample elements.
The formula for standard deviation is different while working with sample data or with population data.
One more dispersion type is variance (squared standard deviation). It is often used in theorems proofs, but in case of work with it, you should remember the dimension.