Today we are going to cover the basics of one of the most popular plots used in statistics — the box plot. We will also take a look at how to design it using the matplotlib library in Python.
A box plot (also known as a box-and-whisker plot) is a convenient way to visualize the distributions of numerical data using quartiles. Box plots are widespread in descriptive statistics; they allow you to quickly explore one or more datasets. One of the advantages of box plots is that they are very concise. It is especially useful when you want to compare distributions across large groups or datasets.
Box plot metrics
The summary metrics of a box plot are:
The first and third quartiles ( and ) that correspond to the 25th and 75th percentiles;
An interquartile range () indicates the range of values from to ;
A mean, the arithmetic average of all values;
A median, the middle value of the data set;
A minimum value excluding outliers. It is ;
A maximum value excluding outliers. It is ;
The outliers, or the observations that fall outside the rule. They are displayed as single points in line with whiskers.
The straight lines coming out of the box are whiskers. They indicate a degree of dispersion (also, variance) outside the first and third quartiles.
Creating a box plot
The basic matplotlib syntax for plotting a box plot looks like this:
plt.boxplot(data)where data is an array of data values that you want to plot.
There is a great number of optional arguments. In this topic, however, we are going to cover only the most essential ones:
vert, ifFalse, produces a horizontal box plot;tick_labelsis a sequence of strings that sets a label for each dataset;showmeans, ifTrue, displays the mean values as a triangle on the box (set toFalseby default);meanline, ifTrue, alongside withshowmeans=Truedisplays the mean as a line;boxprops,medianprops,meansprops,whiskerprops,capprops, andflierpropsallow us to change the properties of the box, median, mean, whiskers, caps, and outliers, respectively.
You can find more box plot parameters in the Official Matplotlib Documentation.
Let's create a simple box plot:
# import the required libraries
import matplotlib.pyplot as plt
import numpy as np
# set the numpy seed for results reproducibility
np.random.seed(14)
# generate data
data = np.random.normal(0, 4, size=200)
# create a boxplot
plt.boxplot(data)
plt.show()Take a look at how we use NumPy random sampling. In particular, we have created a random normal (Gaussian) distribution and passed it the arguments of the desired mean and standard deviation values and the number of samples we want.
Horizontal box plot
To create a horizontal boxplot, you need to set the vert argument to False:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(14)
data = np.random.normal(10, 60, size=200)
plt.boxplot(data, vert=False)
plt.show()Multiple box plots
Let's go further and plot several box plots side by side:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(14)
data_1 = np.random.normal(50, 40, 200)
data_2 = np.random.normal(60, 30, 200)
data_3 = np.random.normal(70, 20, 200)
data_4 = np.random.normal(80, 10, 200)
data = [data_1, data_2, data_3, data_4]
plt.figure(figsize=(10, 7))
plt.boxplot(data)
plt.show()As you can see, plotting several box plots is very straightforward. You only need to pass a list of arrays containing data to the data parameter.
Box plot labels
You need to pass a list of strings to the tick_labels argument to label each box plot. To make the code a bit more readable, it is advised to create a separate list of labels and then pass it to the tick_labels parameter. As for labeling the axes and giving a plot a title, it is a standard matplotlib procedure: you pass a string to the corresponding plt.xlabel, plt.ylabel, or plt.title parameter with an optional fontsize argument:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(14)
data_1 = np.random.normal(50, 40, 200)
data_2 = np.random.normal(60, 30, 200)
data_3 = np.random.normal(70, 20, 200)
data_4 = np.random.normal(80, 10, 200)
data = [data_1, data_2, data_3, data_4]
plt.figure(figsize =(10, 7))
labels = ['first', 'second', 'third', 'fourth']
plt.boxplot(data, tick_labels=labels)
plt.ylabel('Values')
plt.xlabel('Data sets')
plt.title('Multiple box plot example', fontsize=14)
plt.show()Our box plots look much nicer, but they are quite blunt and boring to look at, don't you agree? Let's try and fix that!
Box plot colors
To be able to fill in the color of the box, set patch_artist=True. The reason for this is that by default, box plots in matplotlib are drawn with the Line2D artist. It returns a dictionary containing each part of the boxplot. These parts are Line2D objects. However, by definition, these objects do not have an edgecolor or facecolor properties; they have only one color.
If patch_artist is set to True, the plots are drawn using the Patch artist that sets the boxes as patches, not just simple lines. A patch is a name inherited from MATLAB. It is a 2D patch of color on the figure, for example, rectangles, circles, and polygons. There are different artist types in matplotlib; you can read more about each in the Official Documentation. After that, you will be able to tweak the appearance of various box plot parameters such as boxprops, whiskerprops, capprops, medianprops, and some others. These arguments accept a dictionary of key-value pairs corresponding to the parameter and its value.
Again, to make the code easier to comprehend, it is advised to create separate variables for each dictionary of parameters:
import matplotlib.pyplot as plt
import numpy as np
boxprops = {'facecolor': 'lightblue', 'edgecolor': 'teal', 'linewidth': 2.0}
whiskerprops = {'color': 'green', 'linewidth': 1.5}
capprops = {'color': 'orange', 'linewidth': 1.5}
medianprops = {'color': 'black', 'linewidth': 2}
np.random.seed(14)
data = np.random.normal(50, 40, size=200)
plt.boxplot(data,
patch_artist=True,
boxprops=boxprops,
whiskerprops=whiskerprops,
capprops=capprops,
medianprops=medianprops)
plt.show()Our final box plot may have questionable color choices, but it lets us see how flexible our parameter choice is for each part of a plot. Take a moment to go through the code and see how each parameter is reflected in the box plot.
Here is a list of some properties of a box plot that you can customize:
boxprops = {'color': 'b', 'facecolor': 'none', 'linestyle': '-', 'linewidth': 1.0}
medianprops = {'color': 'b', 'linestyle': '-', 'linewidth': 1.0}
whiskerprops = {'color': 'b', 'linestyle': '-', 'linewidth': 1.0}
capprops = {'color': 'b', 'linestyle': '-', 'linewidth': 1.0}
flierprops = {'color': 'b', 'marker': 'o',
'markerfacecolor': 'none', 'markeredgecolor': 'k'}
meanprops = {'color': 'b', 'linestyle': '-', 'linewidth': 1.0,
'marker': '^', 'markerfacecolor': 'g', 'markeredgecolor': 'k'}You may have noticed that many properties share the same parameters, such as color and linewidth. If you do not want each part of the box plot to have a distinctive color or width (bear in mind that creating a separate dictionary for each property is the least exciting thing to do), we can make use of this similarity and set the properties values through a for loop:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(1)
data = np.random.normal(50, 80, size=200)
plt.figure(figsize=(10, 7))
plot = plt.boxplot(data, patch_artist=True, showmeans=True)
edge_color = 'green'
fill_color = 'lightgreen'
marker_color = 'orange'
for prop in ['boxes', 'whiskers', 'means', 'medians', 'caps']:
plt.setp(plot[prop], color=edge_color, linewidth=1.5)
for prop in ['boxes']:
plt.setp(plot[prop], facecolor=fill_color)
for prop in ['fliers', 'means']:
plt.setp(plot[prop], markerfacecolor=marker_color,
markeredgecolor=marker_color)
plt.show()In the code above, we use the plt.setp(item, properties) functionality to set the properties of the boxes, whiskers, fliers, medians, and caps.
Conclusion
In this topic, we have covered the basics of box plots and their properties. We have also used the matplotlib library to create simple box plots and tried to customize them. However, there are plenty of other parameters that you can change to customize your box plot. You can use the official matplotlib documentation to gain a deeper understanding of how you can design a box plot.