Visualizing your data is a very important part of data analysis. As you know, there're a lot of plots we can use — scatter, violin, histograms – how do you know which one you need? Don't worry; this topic will help you with that. We'll focus on ways to visualize numeric data (also called quantitative data) presented in form of numbers. Let's start!
Getting started
As always, let's start with the import statements: pandas to deal with the data and matplotlib to create plots.
import matplotlib.pyplot as plt
import pandas as pdIn our examples, we'll use a dataset called Running and Heart Rate Data by Max Candocia. He tracked his running time, speed, distance, heart rate, and so on. It's a big one, so we'll work only with some of the columns:
# load the dataset from csv
data = pd.read_csv('run.csv')
# leave only the columns we need
data = data[['heart_rate', 'total_running_time', 'distance', 'speed']]The shortened version of the dataset looks like this:
Let's also do some basic preprocessing – remove the NaN values (rows with None values instead of numbers). Some of the functions we use in this topic won't work if there are None values in an array. Fortunately, pandas provides a df.dropna() function:
data.dropna(inplace=True)inplace=True means that the function changes our dataset instead of creating a copy.
We're ready to create some plots! In this topic, we'll show you examples of both univariate (with only one variable) and bivariate (with two variables) distributions. Let's begin!
Histogram
Histograms show a univariate distribution; they're widely used for the representation of numeric data. A histogram is a bar plot at the heart; the values are grouped into bins, and the height of bars represents data entries in that bin. This way histograms allow us to see the shape of a distribution.
Let's try to plot the distance variable from our dataset. Our first step is to pass it to the plt.hist() function:
plt.hist(data['distance'])We've got a histogram now, but it's not very pretty:
How can we make it better? First, let's use the bins argument to change the number of data bins and edgecolor to make edges between bins visible. Then we can change the x-ticks with plt.xticks() and set a title to our plot with plt.title(). Here's the code:
# create a histogram
plt.hist(data['distance'], bins=25, edgecolor='white')
# change the x-ticks
xticks = range(0, 45, 5)
plt.xticks(xticks)
# set a title
plt.title('Running distance')The bins argument specifies the number of bins (if it's an int) or the edges of bins (if it's a list). There's no universal recipe to determine the number of bins; it's different for each case. So just experiment with this parameter and see which number is best for your data.
Now our histogram is prettier and easier to interpret:
So what do we see here? First, let's look at the modality – the number of peaks. Our data is unimodal, meaning there's only one peak (between 0 and 5). If you see two peaks, the distribution is bimodal; if more – multimodal. If there are no peaks at all, bins are all of the same height, the distribution is uniform.
We can also assess how skewed (symmetrical) our distribution is. In our case, it's skewed to the right – that's a positive skew. It means that the tail of smaller values goes in the positive direction. We see that there are much more values between 0 and 10 than between 10 and 40. If it's skewed to the left, that's called a negative skew. If the peak is right in the center, that's a symmetrical distribution.
Another important thing we need to see is the outliers. These are the values that are very different from other data, smaller or larger. Can you see that tiny bin on 40? That's an outlier. However, it's hard to see it as histograms aren't very good at showing outliers. But box plots are perfect for that! Let's move on to them.
Box plot
Box plot (also called box-and-whisker plot) is a plot that shows univariate distribution based on five numbers: minimum, first quartile, second quartile (median), third quartile, and maximum. A quartile is a value that divides an ordered list of numbers into four quarters.
To understand it better, let's look at how they are calculated. We take all the values in our array (running distances in our example) and sort them from smallest to largest. Then we divide this list into four equal parts; the points where we divide are the quartiles: the first quartile () denotes the first 25% of data, the second quartile (, a median) – 50%, and the third () – 75%. Next, we calculate the interquartile range () – the difference between the third and first quartiles. Now, we can calculate minimum and maximum values. Minimum is and maximum is . All data points falling outside the range of minimum and maximum are outliers.
Now, let's create a box plot. The function we need is plt.boxplot(). We'll set the vert argument to False, so that our box plot is horizontal. We'll also add plt.xticks() and plt.title() as in the histogram example. Let's do that:
# create a horizontal box plot
plt.boxplot(data['distance'], vert=False)
# change the x-ticks
xticks = range(0, 45, 5)
plt.xticks(xticks)
# set a title
plt.title('Running distance')We've got our box plot:
The box shows us the interquartile range (distance from first to the third quartile), the orange line shows the median (the second quartile, the middle value in an ordered list of numbers). Then there're lines on both sides of the box. The caps on them indicate the minimum and maximum values. All the circles are outliers.
Is this plot different in terms of visualization than a histogram? Yes, it shows the outliers more clearly. With a histogram, it seemed like there were no values in the range of 30 to 40; here, we see that there are some. We can also see that most values are concentrated between 0 and 18. Do you remember what we call that? Right, skewness. Both histograms and box plots are useful to see whether a distribution is skewed. If it is symmetrical, the box will be in the center. It's on the left side, so the distribution is positively skewed.
Another thing that would be nice to see is the mean. It's the average of the dataset – the sum of all values divided by the number of values. Fortunately, we can see it on a box plot: add showmeans=True to the plt.boxplot(). This would give you a little triangle showing the mean. If you want it to be a line instead, use meanline=True. Here's the code:
plt.boxplot(data['distance'], vert=False, showmeans=True, meanline=True)Now, the plot looks like this:
The punctured green line shows the mean. We see that mean is larger than the median, which makes sense since there are some large values (the circles on the plot) that push it in this direction.
To sum it up: a histogram lets you see the shape of a distribution. A box plot lets you see the quartiles, median, mean, and outliers. Is there a way to combine all of that in one plot? Yes! It's called a violin plot.
Violin plot
A violin plot is used for visualizing the distribution of numeric data. It's a combination of a box plot and a kernel density plot (which is in turn very similar to a histogram). Violin plots have all the features of the box plot and also show the shape of a distribution. You may have noticed that the box plot doesn't show us how many peaks there are in the data. Violin plot does!
The function to create a violin plot is plt.violinplot(). Let's create one:
# create a violin plot
plt.violinplot(data['distance'])
# change the y-ticks
yticks = range(0, 45, 5)
plt.yticks(yticks)
# set a title
plt.title('Running distance')Our violin plot is vertical; that's why we set y-ticks instead of x-ticks. If you want to make it horizontal, use vert=False, just like with the box plot.
Here's our plot:
That horizontal blue shape shows us that there's one peak between 0 and 5 (compare with the histogram). The caps on the vertical line indicate the minimum and maximum values. Note that they aren't calculated in the box plot way, here it's just the smallest and the largest values.
Wait, we've said that the violin plot has all the same features as the box plot. Then where are the mean, median, and quartiles? Don't worry; we just need to specify that we want them. It's done with showmeans=True and showmedians=True inside the plt.violinplot(). If we want quartiles as well, we need the quantiles argument that takes a list (in our case, [0.25, 0.75] – the first and third quartiles).
By default, all these lines are blue which makes it hard to distinguish them. So let's save our plot into the violin variable and set colors to the lines using methods like the violin['cmeans'].set_color(). So now we have the following code:
# create a violin plot with means, medians, and two quantiles
violin = plt.violinplot(data['distance'], showmeans=True, showmedians=True, quantiles=[0.25, 0.75])
# set colors to the lines
violin['cmeans'].set_color('green')
violin['cmedians'].set_color('orange')
violin['cquantiles'].set_color('black')
# change the y-ticks
yticks = range(0, 45, 5)
plt.yticks(yticks)
# set a title
plt.title('Running distance')Our plot has become much more informative:
Now, the black lines are the first and the third quartiles (the edges of the box with the box plot), the orange line is the median, and the green line is the mean. So here you have it, box plot and histogram combined!
It's possible to plot several box plots or violin plots on one graph to compare several arrays of data.
Now, let's move on to another visualization tool for numeric data – a scatter plot.
Scatter plot
The plots we've seen so far all showed univariate distribution. But hey, there's usually more than one column in a dataset! So now we've finally come to bivariate distribution. The scatter plot shows the relationship between two numeric variables plotting individual values as dots.
Let's plot the data on running speed and heart rate. The common practice is to put the explanatory variable on the x-axis and the response value on the y-axis. We suspect that speed may be affecting the heart rate, not the other way around. In our case, speed is the explanatory variable, and heart rate is the response.
Now let's use plt.scatter() and see what we get:
# create a scatter plot
plt.scatter(x=data['speed'], y=data['heart_rate'])
# set a title and labels to the axes
plt.title('Running speed and heart rate')
plt.xlabel('Running speed')
plt.ylabel('Heart rate of the runner')Our plot looks like this:
How about some styling? We have a big dataset, so the plot is quite crowded. Changing the dots style will make it clearer. Let's use the s argument to set the markers' size (dots), color to change their color, and edgecolor to set a different color at the edges. Here's the code for that:
plt.scatter(x=data['speed'], y=data['heart_rate'], s=15, color='white', edgecolor='tab:blue')Let's take a look at the plot now:
We can see the dots more clearly; let's try to interpret the plot. We can see that there's little scatter here (most of the dots are placed tightly), suggesting a strong relationship. If the dots are quite scattered, it's less likely that any relationship exists.
Scatter plots also show us the outliers – individual dots far from others. There are many outliers on our plot: for example, values with heart rate above 200 or speed above 30.
Scatter plots are good for determining whether there's any relationship between the variables and for seeing the outliers.
Conclusion
In this topic, we've learned the basics of numeric data visualization with matplotlib, including some of the statistical terminology (like skewness, modality, median, mean, and so on), how to create different kinds of plots, and interpret the results. Let's quickly sum it up:
Histograms show the shape of a distribution: whether it's symmetrical or not and how many peaks it has;
Box plots show minimum, maximum, median, quartiles, and outliers (values which are very different from others);
Violin plots show the same things as box plots plus the shape of a distribution;
Scatter plots show the relationship between two variables and are useful for outliers' detection.
It's time to practice!