In the course of exploratory data analysis, 100 % you will face the need to plot your data. Charts and graphs are useful to examine feature distribution, detect outliers and errors in data and explore data patterns. You can carry out an initial analysis with pandas visualization tools.
Getting started
matplotlib library. You have to install it to avoid errors.
In this topic, we will use the pandas.DataFrame.plot() function.
Its main parameters are the following:
x: feature to plot along the x-axis;y: feature to plot along the y-axis;kind: the kind of plot to produce.
You can find a full description in the pandas documentation.
Let's dive into examples to understand the theory better. We will use the data from the World Happiness Report, specifically the one from 2015.
import pandas as pd
df = pd.read_csv('2015.csv') # data loadingHistograms
A histogram is a graph applied to data broken down into numerically ordered groups. Each bin represents a particular group and its height is proportional to the frequency.
One uses a histogram to learn the spread of values and their scale (are they measured in tens? or hundreds? or millions?), and to understand the mode of the data (the most frequently occurring value).
Let's first plot the 'Generosity' feature, which describes how generous the people are in a given country.
df.plot(y='Generosity', kind='hist', bins=15)
The bins parameter regulates the number of bins, that is the number of numerically ordered groups in data.
As a result, we get the following plot:
At this point, it is important to take a pause and analyze the graph. Firstly, it is a continuous variable (can take on an uncountable set of values). Secondly, the values fall between 0 and 0.8. We can also see that the mode of the data is about 0.15.
The maximum score of 'Generosity' is about 0.8, and there is only one country, which has 'Generosity' higher than 0.6. You may be interested what country that is, and it is Myanmar, whose 'Happiness Rank' is 129 out of 157. We can interpret the value as a proportion of generous people in a given country. 0 (or 0%) means there are no generous people, and 1 (or 100%) means that all people are generous.
You may also want to plot histograms of two or more features on one plot. Let's see how it works on 'Economy (GDP per Capita)', 'Family', which is how people value family in a given country, and 'Health (Life Expectancy)' features.
df.plot(y=['Family', 'Health (Life Expectancy)', 'Economy (GDP per Capita)'],
kind='hist', bins=8, alpha=0.5)
The alpha parameter regulates the transparency of the bins.
The output is:
Family indexes go from 0 to 1.5, health indexes go from 0 to 1.05, and GDP per Capita goes from 0 to 1.7. Therefore, 'Health (Life Expectancy)' feature has the lowest range and 'Economy (GDR per Capita)' has the greatest range. Furthermore, going deeper into statistics you can notice that the distribution of 'Economy (GDP per Capita)' is roughly uniform and the distribution of 'Family' is normal and left-skewed.
An analogue of a histogram for a categorical variable is a bar plot. For instance, let's examine the 'Region' feature.
There we plot pandas Series (not pandas DataFrame as in the previous examples), so we do not need to specify the y parameter. The important thing is that, at first, we have to apply value_counts(), which returns a Series containing counts of unique values.
df['Region'].value_counts().plot(kind='bar')
Here's how value_counts() transforms the data:
And this is the resulting bar plot:
We still have the frequency of each group along the y-axis, but there are the names of the groups along the x-axis.
Let's take a look at the chart. 'Region' is a categorical variable. We can count 10 regions in the dataset and the most common region is Sub-Saharan Africa.
For instance, since the height of bins is proportional to the frequency, you can say that the number of rows with 'Middle East and Northern Africa' in 'Region' is approximately twice as big as the number of rows with 'Southeastern Asia' in 'Region'.
import matplotlib.pyplot as plt
df['Region'].value_counts().plot(kind='bar')
plt.savefig('name_of_the_pic.jpg', bbox_inches='tight')
Scatter plots
A scatter plot helps to identify a trend between two features. The first feature is along the x-axis and the second one is along the y-axis. Usually, a point on a graph represents a row from a dataset. The X coordinate of a point is the value of the first feature and the Y coordinate is the value of the second feature.
Now let's plot 'Happiness Score' vs. 'Economy (GDP per Capita)'.
df.plot(x='Economy (GDP per Capita)', y='Happiness Score', kind='scatter')
Here's the plot:
What can we tell from this plot? Well, there is a clear upward linear trend: the higher 'Economy (GDP per Capita)' is, the higher the 'Happiness Score' is. You can also spot the outliers and unusual objects, for instance, there are a couple of countries that have a lower happiness score compared to countries with the same GDP per capita values.
Boxplots
This kind of plot is useful to learn and compare the centers of the features (mean and median), and to identify the outliers. Here is the scheme, which describes how it works.
The scheme is taken from here.
Let's draw boxplots for 'Freedom', which is the freedom to choose what you do in life, and 'Family' features.
df.plot(y=['Family', 'Freedom'], kind='box', showmeans=True)
The showmeans parameter regulates whether or not to show the mean on the boxplot.
We get the following boxplots:
The mean is represented by the green triangle, the median is represented by the green line inside the box, and outliers are represented by the circles.
What can we see? Firstly, the range of 'Family' feature is greater than the range of 'Freedom' feature. Secondly, the mean and median of 'Freedom' are similar, but in 'Family' the mean is a bit lower than the median. Finally, 'Family' has outliers while 'Freedom' doesn't have them.
Summary
Pandas is perfect for initial visual analysis. You can get any kind of graph in one line of code. Then you learn the general shape of the data, notice patterns, and even get some insights at this early stage.
Here are some points, which we've discussed in this topic:
- The main function that we can use is
pandas.DataFrame.plot(). - The main parameters regulate the data for the axes and the type of the plot. Additional parameters can make the graph easier for representation (for example,
binsandalphaparameters). - The most common plots are histogram, bar plot, scatter plot, and boxplot.
Feel free to try other types!