There are different ways to analyze a set of quantitative data. Today, we're going to talk about a very helpful way to characterize distributions of random variables: quartiles. The notion of distribution was discussed earlier in the topic "Histograms and distributions".
Definition
Quartiles divide a collection of data points into four equal-sized parts, each containing 25% of the total sample size. There are the first quartile (lower quartile) and third quartile (upper quartile). The first (lower) quartile cuts off of the units with minimum values from the population, and the third (upper) cuts off of the units with maximum values, the second quartile is the median. To calculate the quartile, it is necessary to find the median of the sample which is our second quartile. On either side of the median we will have equal parts, by finding their corresponding medians we will get the first and the third quartiles. Sometimes there are also zeroth and fourth quartiles, which are simply the minimum and maximum values, respectively.
Quartiles can only be calculated when the data is ordered from the smallest to the largest.
For example, if the sample consists of 6 items, then the second item is taken as the first quartile of the sample, and the fifth item is taken as the third quartile. Let's take a look at the graph below:
Here we see that there is no actual median represented by a specific value, so we have to calculate it by ourselves by finding the mean of and : . Luckily, here we have single elements corresponding to the and quartiles: these are and . But what if, like with the median, we are not able to find them in a distribution? Then, we apply the following formulas:
To find the index of an element that is the first quartile: , where stands for the number of elements in a distribution.
To find the index of an element that is the third quartile: , where stands for the number of elements in a distribution.
Summing up, we have the following table:
Quartile | How to find it |
Minimum value | |
Median | |
Maximum value |
Note:
If the number of values is odd, the median equals the middle value of a sorted list of values.
If the number of values is even, the median equals the sum of the middle pair of values divided by two.
Imagine we have a sample: . It has only five elements. The task is to find the values in all five quartiles. It's quite obvious that . The median value is , so . After applying the formulas for the remaining two quartiles, we've got that the second element is the first quartile (), and the fourth element is the third quartile (), so . Note that in some cases that the element index you get from the formulas above turns out to be a fraction.
Other quantiles
Quantiles are quantities that divide a population into a certain number of parts equal in the number of elements. We have just talked about quartiles, which are used when dividing ranked series into equal parts, but there are some more types of quantiles. The most famous quantile is the median, which divides the population into equal parts. In addition to the median, there are deciles and percentiles.
Deciles are options that divide the ranked series into equal parts. The first decile cuts off of the population, and the ninth decile cuts off . Thus, deciles are distinguished.
Percentiles, coming from the word "percent", divide the ranked series into equal parts. Accordingly, the median is the percentile, and the first and third quartiles are the and percentiles, respectively. In general, we can see that the concepts of quantile and percentile are interchangeable.
Let's say a souvenir company wants to know its production rate. In order to do that, we need to find quartiles, the first decile, and the ninth decile. The list below shows the number of souvenirs made by each worker on a given day:
First of all, we sort the values and note that there are of them:
Next, we find the quartiles. The median divides our sample into two halves with values in each one (the median is indicated by a blank space). The quartile is the value, the quartile is the value: . Finally, we find deciles: the decile is the value, and the decile is the value: .
Range and interquartile range
Range (R) is the difference between the maximum and minimum values of the variation series: .
The interquartile range (IQR) is a measure of the variability of a sample. It is defined as the difference between the upper and lower quartiles:
Range and interquartile range both measure the spread in a data set. Looking at the spread lets us see how much data varies. The range is a quick way to get an idea of the spread. It takes longer to find the IQR, but it sometimes gives us more useful information about the spread. For example, IQR is better than range because it not only shows outliers but data skewness as well.
Let's see how IQR is widely used to detect outliers. Outliers are usually defined as observations that fall below the lower fence that is or above the upper fence: ().
Example
It's time to practice. Let's assume we have a sample . Our task is to find all the quartiles of the sample.
First, we have to sort the data: .
From the sorted data it becomes apparent that .
Now, we'll calculate the median value: .
Next, we should find the index of an element that would be the first quartile. . The third element is , the fourth element is . So now let's find the exact value of the first quartile using linear interpolation:
For the third quartile, we'll repeat almost the same set of calculations: ; , so .
Now, we need to find the interquartile range: .
Let's check whether our data has outliers. The lower fence is , the upper fence is . In our case, there aren't any observations that lie below . However, above the upper fence, on the contrary, lies one observation, namely, . So, is the outlier for our sample.
Conclusion
In this topic, we have discussed the idea of quartiles and interquartile ranges, how to calculate them, and what other types of quantiles exist.