A couple of decades ago, mathematical statistics was the doctrine of economists and financial analysts. Today, it is impossible to imagine entrepreneurs, traders, and ML (machine learning) specialists not involved in that math area. In this topic, we consider three basic terms of statistics: the mean, the median, and the mode. All of them are kind of an "average", however, not identical. We will look at examples and try to understand the difference between them.
Mean
The most well-known statistics term is the mean. You could find it anywhere: in media, sociological research, or even the statistics of football players. Let's look at the mean definition.
Suppose that we have an -elements sample . Then, the mean of the sample is defined as the sum of all elements divided by the number of elements in the set:
Often, the mean is denoted as (-bar).
Let's look at an example regarding shop revenue.
Imagine that you're the shop owner, and you know how much money the shop earned each day last week.
| Mon | |
|---|---|
| Tue | |
| Wed | |
| Thu | |
| Fri | |
| Sat | |
| Sun |
Now, you would like to know the average revenue for those days. This is where mean comes into play.
We should consider our sample as a set of daily incomes: . Then:
That means that your average daily revenue is equal to . The revenue on Wednesday was average, while revenues on Tuesday, Thursday, and Friday were lower than average. The revenue on other days was higher than the average.
Drawbacks of the mean
However, the mean value is not always objective. Let's look at the example below and see why it is the case.
Sociologists decided to conduct research and figure out the average salary of factory workers per week. They interviewed all employees of the plant and received the following results:
- ordinary workers make per week
- administrators make per week
- associate director makes per week
- director makes per week
Then our set is:
And the average salary would be:
This result couldn't be considered objective because the average salary is more than twice as high as the salary of the ordinary workers – the majority of the plant employees. That happens because we have people with very high incomes. In terms of statistics, these numbers could be considered outliers – values that differ significantly from other observations. They may indicate an experimental error or unusual variability in measurements. As you can see, outliers could seriously distort the true situation when you evaluate the mean.
And that's why analysts do not rely exclusively on the mean in data analysis.
Median
In some cases, the outliers issue could be solved by considering the sample median instead of the mean.
What is it? To understand that, we should sort our set either in ascending or descending order. The number located in the middle of the sorted set is called the median.
Imagine that we have a 9-element set . Then, we arrange it in increasing order and find the value in the middle:
When the sample has an even number of elements, we take the arithmetic mean of two elements in the middle. This result would be the median:
In the example with salaries, the median value is equal to , which represents the average salary way better than the mean value. However, the median is not a silver bullet either, and here's the reason why.
Mode
Now, you're a car manufacturer and you would like to know, which color is the most popular among consumers. You need this information since a new batch of autos will be produced soon and you need to decide what colors to paint it in.
To find the answer, you monitor a random highway for an hour and count the number of cars of each color. You get the following results:
| White | |
|---|---|
| Black | |
| Blue | |
| Red | |
| Gray | |
| Others |
An analysis of the set involving the mean or the median would not be informative because we explore a non-numeric parameter – the most attractive color of a car.
To handle such cases, we should construct a sample using colors: The most common element in the sample is called the mode – here it is "black". And as a car manufacturer, you are interested in that parameter.
The key difference between the mode and the mean or the median is the ability to characterize non-numeric sets. But that doesn't mean that the mode isn't used with samples consisting of numbers.
Numeric samples also have modes. Look at the set . Now, group it by the same elements to make it easier to count them:
The most common elements here are and – they both occur three times in the set. Both of them are the modes of .
Conclusion
As you understand, the mean, the median, and the mode are characteristics of any sample. Which of them is more informative or objective, sometimes depends on the set. And that is one of the reasons why statistics is not such a straightforward science. Let's review the key points:
- The (arithmetic) mean is the simplest and most widespread characteristic of a numeric sample. However, its weakness is the outliers. Their presence in the sample can greatly alter the value of the mean, which is highly undesirable.
- The median is also a characteristic of only numeric sets since we need to sort the sample to find it. It may help us to overcome the outliers issue, and thus it is more informative than the mean.
- The mode could be used for both numeric and non-numeric sets. It is reasonable to use it when we would like to know not the general characteristics, but the most popular element(-s) of a set. A sample may have several modes, don't forget it!
The rest will come with experience!