MathStatisticsDescriptive statistics

Mean, median, and mode

6 minutes read

A couple of decades ago, mathematical statistics was the doctrine of economists and financial analysts. Today, it is impossible to imagine entrepreneurs, traders, and ML (machine learning) specialists not involved in that math area. In this topic, we consider three basic terms of statistics: the mean, the median, and the mode. All of them are kind of an "average", however, not identical. We will look at examples and try to understand the difference between them.

Mean

The most well-known statistics term is the mean. You could find it anywhere: in media, sociological research, or even the statistics of football players. Let's look at the mean definition.

Suppose that we have an nn-elements sample X={x1,x2,...,xn}X = \{ x_1 , x_2, ... , x_n\}. Then, the mean of the sample XX is defined as the sum of all elements divided by the number of elements in the set:

xˉ=x1+x2+...+xnn\bar x = {x_1 + x_2 + ... + x_n \over n}Often, the mean is denoted as Xˉ\bar{X} (xx-bar).

In fact, this term is correctly called the arithmetic mean because there are several types of the mean in math statistics. Usually, the context helps to understand which kind of the mean is mentioned.

Let's look at an example regarding shop revenue.

Imagine that you're the shop owner, and you know how much money the shop earned each day last week.

Mon 1000$1000\$
Tue 800$800\$
Wed 900$900\$
Thu 700$700\$
Fri 600$600\$
Sat 1100$1100\$
Sun 1200$1200\$

Now, you would like to know the average revenue for those 77 days. This is where mean comes into play.

We should consider our sample as a set of daily incomes: X={1000,800,900,700,600,1100,1200}X = \{ 1000, 800, 900, 700, 600, 1100, 1200 \}. Then:

xˉ=1000+800+900+700+600+1100+12007=63007=900\bar x = {1000 + 800 + 900 + 700 + 600 + 1100 + 1200 \over 7} = {6300 \over 7} = 900

That means that your average daily revenue is equal to 900$900\$. The revenue on Wednesday was average, while revenues on Tuesday, Thursday, and Friday were lower than average. The revenue on other days was higher than the average.

Drawbacks of the mean

However, the mean value is not always objective. Let's look at the example below and see why it is the case.

Sociologists decided to conduct research and figure out the average salary of factory workers per week. They interviewed all employees of the plant and received the following results:

  • 4040 ordinary workers make 100$100\$ per week
  • 33 administrators make 500$500\$ per week
  • 11 associate director makes 1000$1000\$ per week
  • 11 director makes 3000$3000\$ per week

Then our set is:X={100,...,10040 times,500,500,500,1000,3000}X = \{ \underbrace{100, ..., 100 }_{40 \text{ times}} , 500, 500, 500, 1000, 3000 \}

And the average salary would be:

xˉ=10040+5003+1000+300045=950045=211\bar x = {100 \cdot 40 + 500\cdot 3 + 1000 + 3000\over 45} = {9500 \over 45} = 211This result couldn't be considered objective because the average salary is more than twice as high as the salary of the ordinary workers – the majority of the plant employees. That happens because we have 55 people with very high incomes. In terms of statistics, these 55 numbers could be considered outliers – values that differ significantly from other observations. They may indicate an experimental error or unusual variability in measurements. As you can see, outliers could seriously distort the true situation when you evaluate the mean.

And that's why analysts do not rely exclusively on the mean in data analysis.

Median

In some cases, the outliers issue could be solved by considering the sample median instead of the mean.

What is it? To understand that, we should sort our set either in ascending or descending order. The number located in the middle of the sorted set is called the median.

The median divides the initial set into two halves – one of them is exactly not more than the median, and the other is precisely not less.

Imagine that we have a 9-element set X={4,7,16,9,12,5,11,21,0}X = \{4, 7, 16, 9, 12, 5, 11, 21, 0\}. Then, we arrange it in increasing order and find the value in the middle:

{0,4,5,74 elements9middle = median11,12,16,214 elements}\{ \underbrace {0,4 , 5, 7}_{4 \text{ elements}} \bigg| \underbrace{9}_{\textbf{middle = median}} \bigg| \underbrace{11, 12, 16, 21}_{4 \text{ elements} } \}

When the sample has an even number of elements, we take the arithmetic mean of two elements in the middle. This result would be the median:

{0,4,5,74 elements 9,11 12,16,21,244 elements}9+112=10median          \{ \overbrace {0,4 , 5, 7}^{4 \text{ elements}} \bigg| \underbrace{ ~9, 11~}_{\mathbf{\bigg\downarrow}} \bigg| \overbrace{ 12, 16, 21, 24}^{4 \text{ elements} } \}\\ \\ {9 + 11 \over 2} = \underbrace{10}_{\textbf{median}} ~~~~~~~~~~

In the example with salaries, the median value is equal to 100$100\$, which represents the average salary way better than the mean value. However, the median is not a silver bullet either, and here's the reason why.

Mode

Now, you're a car manufacturer and you would like to know, which color is the most popular among consumers. You need this information since a new batch of autos will be produced soon and you need to decide what colors to paint it in.

To find the answer, you monitor a random highway for an hour and count the number of cars of each color. You get the following results:

White 293293
Black 565565
Blue 154154
Red 135135
Gray 321321
Others 9898

An analysis of the set X={293,565,154,135,321,98}X = \{ 293, 565, 154, 135, 321, 98\} involving the mean or the median would not be informative because we explore a non-numeric parameter – the most attractive color of a car.

To handle such cases, we should construct a sample XX using colors: X={ white, .. ,white293,black, .. , black565, blue, .. ,blue154,red, .. ,red135, gray, .. ,gray321,other, .. ,other98}X = \{\underbrace{\text{ white, .. ,white}}_{293}, \underbrace{ \text{black, .. , black}}_{565}, \underbrace{\text{ blue, .. ,blue}}_{154}, \underbrace{ \text{red, .. ,red}}_{135}, \underbrace{\text{ gray, .. ,gray}}_{321}, \underbrace{\text{other, .. ,other}}_{98} \}The most common element in the sample is called the mode – here it is "black". And as a car manufacturer, you are interested in that parameter.

The key difference between the mode and the mean or the median is the ability to characterize non-numeric sets. But that doesn't mean that the mode isn't used with samples consisting of numbers.

Numeric samples also have modes. Look at the set X={1,2,3,4,2,3,3,2,4}X = \{1, 2, 3,4,2,3,3,2,4 \}. Now, group it by the same elements to make it easier to count them:X={1,2,2,2,3 3,3,3,34,4}X = \{1, \underbrace{2,2,2,}_{3} ~ \underbrace{3,3,3,}_{3} 4,4\}

The most common elements here are 22 and 33 – they both occur three times in the set. Both of them are the modes of XX.

In contrast with the median and the mean, there could be more than one mode in a set. If a sample consists of unique elements only, each of these elements is the mode!

Conclusion

As you understand, the mean, the median, and the mode are characteristics of any sample. Which of them is more informative or objective, sometimes depends on the set. And that is one of the reasons why statistics is not such a straightforward science. Let's review the key points:

  • The (arithmetic) mean is the simplest and most widespread characteristic of a numeric sample. However, its weakness is the outliers. Their presence in the sample can greatly alter the value of the mean, which is highly undesirable.
  • The median is also a characteristic of only numeric sets since we need to sort the sample to find it. It may help us to overcome the outliers issue, and thus it is more informative than the mean.
  • The mode could be used for both numeric and non-numeric sets. It is reasonable to use it when we would like to know not the general characteristics, but the most popular element(-s) of a set. A sample may have several modes, don't forget it!

The rest will come with experience!

76 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo