6 minutes read

The area of statistics plays an important role in everyday life. It can help you obtain certain knowledge from the uncertain and complex real world, but if misused, the results can be harmful or misleading. It also has applications in many professional areas such as marketing or linguistics. Of course, making manual statistical calculations may be tiring, but there is a great solution. Python has the built-in statistics module which provides functionality to calculate mathematical statistics of numeric data. In this topic, we are going to cover the main things you can find there.

Main functions

To start working with the module, just import it.

import statistics

After importing, you can start your statistical calculations. There are a lot of functions for making calculations, but let's describe some of them in the table below.

Functions

Description

statistics.mean()

The average value of data.

statistics.median()

The median (middle value) of data.

statistics.mode()

The most common value of data.

statistics.quantiles()

Division of data into intervals with equal probability.

statistics.pstdev()

The standard deviation of the entire population.

statistics.stdev()

The standard deviation of a sample.

statistics.pvariance()

The variance of the entire population.

statistics.variance()

The variance of a sample.

Now let's have a closer look at these functions.

Mean, median, mode, and quantiles

  1. Imagine the following situation. You bought some toys in a shopping center and decided to write all the prices in a list.

    prices = [300, 450, 230, 120, 150, 130, 400, 300]

    Then you decided to count the mean, median, and mode of the prices. First, you use mean() to find the average value that is equal to the sum of the data divided by the number of elements in the list.

    print(statistics.mean(prices))  # 260

    So, you pass the list to the function, and then you obtain the average price of a toy.

    If we try to find the mean of an empty list, the StatisticsError will be raised.

  2. Let's return to our prices list. As for the median, it's also easy to obtain.

    print(statistics.median(prices))  # 265.0

    How was it computed? As there is an even number of elements in the list, the median is calculated by taking the average of the two middle elements (based on their values, not their positions in the list). If we order the prices in ascending order, we find out that the two middle values are 230 and 300. And the average of their sum is really 265. If we had an odd number of elements, the function would return the middle one.

  3. Finally, you can find the mode that is the most frequent value in the list.

    print(statistics.mode(prices))  # 300

    Mind that if there are several elements with the same number of occurrences, the function will return the first one.

    print(statistics.mode([11, 4, 7, 0]))  # 11
  4. If you want to divide the list into n continuous intervals with equal probability, use the quantiles(list, n). It returns a list of n - 1 cut-points separating these intervals. If you don't specify n (the number of intervals), your list will be divided into four intervals, so three cut-points will be returned.

    print(statistics.quantiles(prices))  # [135.0, 265.0, 375.0]

    You can also use any number of quantiles you need.

    print(statistics.quantiles(prices, n=6))  # [125.0, 150.0, 265.0, 300.0, 425.0]

    In the example above, we decided to get six intervals in the list of prices, so we obtained a list of five cut-points.

Standard deviation and variance

You could have noticed that we mentioned two functions for both obtaining standard deviations and variances: for the entire population and for a sample. These ways have certain differences.

In statistics, a variance is the average of the squared deviation of values in a list from their mean. To calculate it, you can use two formulae differing in the denominators: var=i=1n(xiμ)2nvar = {\sum_{i=1}^{n} (x_i-μ)^2\over n} or var=i=1n(xiμ)2n1var = {\sum_{i=1}^{n} (x_i-μ)^2\over n-1}. Here, xix_i is an element in a list, μμ is the mean, and nn is the number of elements in the list.

The first formula is used when you deal with the entire population (the whole dataset), and the second one is used if you work with a sample (a part of your dataset). So, pvariance() is for the entire population, and variance() is for a sample.

As standard deviation is the square root of the variance, pstdev()and stdev() are used in the same way.

In the example from the previous sections, we deal with all the dataset, so we will use pvariance() and pstdev().

print(statistics.pvariance(prices))  # 13550
print(statistics.pstdev(prices))  # 116.40446726822816

Variance measures how far a set of data is spread out. A small variance indicates that the data values tend to be very close to the mean. A high variance indicates that the data points are very spread out from the mean. In this example, the prices have great differences as the variance is rather high. As for the standard deviation, it represents a typical deviation from the mean. Our prices have a deviation approximately equal to 116.4, and it shows that the prices are rather far from its mean equal to 260.

Now let's imagine that we randomly select three prices from our list. So, our sample will be the following.

random_prices = [230, 150, 130]

We want to calculate the variance and the standard deviation of the prices. In this case, we will use variance() and stdev().

print(statistics.variance(random_prices))  # 2800
print(statistics.stdev(random_prices))  # 52.91502622129181

Mind that if your dataset has less than two values, the StatisticsError will be raised.

Summary

So far, we have overviewed the basics of the statistics module. Now you know:

  • how to import the module;

  • how to find a mean, median, mode, and quantiles for a dataset;

  • the difference between the usage of statistics.pstdev()and statistics.stdev();

  • the difference between the usage of statistics.pvariance()and statistics.variance().

If you are eager to learn more, read the documentation. Right now, we will proceed to the tasks.

Read more on this topic in Diving into Statistical Programming on Hyperskill Blog.

18 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo