Computer scienceData scienceInstrumentsVisualizationKinds of graphs

Matplotlib histogram

9 minutes read

You've been exposed to some matplotlib, right? Do you know how to create histograms? No? This topic can help you build up your skills in data visualization.

A histogram is a graphical display of data that organizes groups of data points into ranges. These ranges are represented by bars. It resembles a bar chart, but it's not quite the same. The key difference is that you use a bar chart for categorical data representation, while a histogram displays only numerical data. You've seen a lot of histograms before. Now, it's time to learn to create your own!

Creating a simple histogram

First of all, we need to import matplotlib.pyplot to our code:

import matplotlib.pyplot as plt

This is how we can create a histogram:

plt.hist(x)

Where x is an array of values that we want to plot. Let's say you've decided to plot the height of your friends. You have two friends who are 163 cm tall, one who is 164 cm, and so on. Now, let's create a very simple histogram with only 8 values:

data = [163, 163, 164, 170, 170, 172, 173, 190]
plt.hist(data)

Note that the array of values is the only required argument. We just pass it to the function to obtain a simple histogram:

Create a simple histogram with matplotlib

Though our plot is valid, we can still improve it. What does it represent? To make it clearer, we need to specify the values for plt.title(), plt.xlabel(), and plt.ylabel(). You can also change the color by using the color argument and adding a border between adjacent bars with edgecolor. Let's have a look at the following example:

plt.hist(data, color="orange", edgecolor="white")
plt.title("My friends' height")
plt.ylabel("Number of people")
plt.xlabel("Height in cm")
plt.show()

Change the color and edgecolor of a histogram

Changing bins

Everyone can see what kind of data our histogram represents, but where each group starts and ends? The first one probably starts at 163 and continues to somewhere around 166, the second one may be from 167 to 173, and so on. This ambiguity is bad when you want to present your data clearly and concisely.

To deal with it, we need to adjust bins arguments. Let's say we want to divide our data values into 3 groups: from 160 to 170 cm, from 170 to 180, and from 180 to 190. To do that, we need to pass a list of these values to the bins argument:

bins = [160, 170, 180, 190]
plt.hist(data, bins=bins, color="orange", edgecolor='white')

Here's what we get:

Change the bins of a histogram

In this histogram, we can see clearly that you have three friends with the height of 160 to 170, four from 170 to 180, and only one from 180 to 190.

The bins argument can take not only a list but also an int. A list defines bin edges. An int defines the number of equal-width bins. So, if we want to have four bins in our histogram, we simply write bins=4. You can try this on your own and see how it looks. There's also a third option – use a str as bins value. It has to be a name of one of the binning functions (strategies used to calculate the edges of bins) supported by numpy, such as 'rice', 'scott', 'sqrt', and so on. We won't go much into detail here, but you are welcome to read the Official documentation.

Cutting off data

Our imaginary dataset consists of only 8 values, but real-world datasets are much bigger. Sometimes, you just don't need all the data that's in there. All you need to do is to pass a tuple to the range argument that specifies the start and end values. Assume that we only want to see people from 170 to 180 cm in our histogram:

plt.hist(data, color="black", edgecolor='white', range=(170, 180))

Here's the result:

Plot only data within a certain range on a histogram

Plotting multiple datasets

Let's say you have a brother, Andy. He has noted the height of his friends, too, and now you want to plot these two datasets together to compare them. How do you do that?

You need to pass a list of datasets to plt.hist() and, preferably, add plt.legend() to make your plot easier to interpret. Legend works the same way as in other types of plots in matplotlib — just add a label argument inside plt.hist() and there you go!

my_data = [163, 163, 164, 170, 170, 172, 173, 190]
andy_data = [161, 172, 174, 175, 181, 183, 186, 190]
bins = [160, 170, 180, 190]
names = ["my friends", "Andy's friends"]

plt.hist([my_data, andy_data], bins=bins, label=names)
plt.title("Mine and Andy's friends' height")
plt.ylabel("Number of people")
plt.xlabel("Height in cm")

plt.legend()
plt.show()

Here is the resulting plot:

Create a side-by-side histogram with matplotlib

We can see that Andy's friends are generally taller than ours. We have three friends from 160 to 170 cm while he has one. We have only one friend from 180 to 190 cm, while Andy has four.

By the way, if blue and orange aren't your favorite colors, you can change them. It works almost the same way as with one dataset — pass a list of str to the color argument. For example:

plt.hist([my_data, andy_data], bins=bins, label=names, color=['red', 'green'])

In this example, bars are placed side-by-side. It is the default way to plot multiple datasets. The alternative way is to stack the values on top of each other. You can do that by setting the stacked argument to True. We also add edgecolor for better readability:

plt.hist([my_data, andy_data], bins=bins, label=names, stacked=True, edgecolor='white')

This will result in the following plot:

Create a stacked histogram with matplotlib

This plot is a bit ambiguous but still interpretable. We generally recommend sticking to the side-by-side placement of the bars.

Conclusion

In this topic, we have covered the basics of histogram creation with matplotlib. Now you know how to plot a simple histogram, how to change the number of bins and their edges, how to cut off unnecessary data, and how to visualize several datasets on one plot. If you want to dive into more detail, check out the Official matplotlib documentation.

36 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo