Computer scienceData scienceInstrumentsPandasData analysis with pandas

Summarizing numeric columns

8 minutes read

pandas is famous for its analytical tools. When we need to compare datasets or evaluate the data that is grouped by categories, we need to refer to summary statistics. Entry-by-entry data comparison is impractical. Moreover, an analyst should be able to describe myriads of small data discrepancies in more general terms. And central tendency measures can be useful for this purpose.

The dataset

At first, we need to import pandas. In this topic, we are going to use the Palmer penguins dataset. You can import it with one line from GitHub (requires internet connection):

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')

Column names are self-explanatory:

df.head(3)

Output:

+----+-----------+-----------+------------------+-----------------+---------------------+---------------+--------+
|    | species   | island    |   bill_length_mm |   bill_depth_mm |   flipper_length_mm |   body_mass_g | sex    |
|----+-----------+-----------+------------------+-----------------+---------------------+---------------+--------|
|  0 | Adelie    | Torgersen |             39.1 |            18.7 |                 181 |          3750 | MALE   |
|  1 | Adelie    | Torgersen |             39.5 |            17.4 |                 186 |          3800 | FEMALE |
|  2 | Adelie    | Torgersen |             40.3 |            18   |                 195 |          3250 | FEMALE |
+----+-----------+-----------+------------------+-----------------+---------------------+---------------+--------+

The R Palmer Penguins dataset logo

Central tendency measures

Comparing the datasets often means comparing the average. And by "average" they mean... the mean. The arithmetic mean, to be precise. Let's calculate the mean value for the flipper length:

df.flipper_length_mm.mean()

Output:

200.91520467836258

As you probably know, the mean is not a suitable measure for distributions that are far from normal (roughly speaking, a normal distribution is a symmetrical bell-shaped curve with the mean in the center). The median can be more useful for such cases. Let's find it for the same distribution:

df.flipper_length_mm.median()

Output:

197.0

The flipper length distribution is close to normal in our case (Gaussian).

There can be situations when you need to find the row mean or median. Suppose we came up with the idea to measure penguins with only one value. It can be something like "penguin size", calculated as the mean of all available measurements of a penguin. We will do it by setting the axis argument to columns. Take one experimental penguin (first row) for this experimental measure:

df.head(1).mean(axis='columns', numeric_only = True)

The numeric_only parameter is set to False by default, thus, in order to get the mean value, you have to manually select the valid (numeric, in this case) columns.

Output:

0    997.2
dtype: float64

As for the distribution mode, it makes more sense to find it for categorical features. Let's find the mode for the sex of penguins:

df.sex.mode()

Output:

0    MALE
Name: sex, dtype: object

Our dataset contains more males. Also, note that we have a Series as a result. If this was a multimodal distribution, we would get multiple modes (several rows).

All the methods listed above can be also called from a pandas DataFrame. In that case, they will return the summary statistics for every column in that DataFrame if it is possible. To count only the numeric columns, set the numeric_only argument to True.

DataFrame.describe

The next useful method for evaluating your data is DataFrame.describe(). It generates descriptive statistics by all columns of a DataFrame. This method outputs the following values for numerical type columns:

  • count — overall number of entries (non-unique);

  • mean — arithmetic mean;

  • std — standard deviation;

  • min, max — lowest and highest values from the column;

  • 25%, 75% — values of 1st and 3rd quartile;

  • 50% — median value.

By default, describe() will generate the statistics for numeric columns only.

You can pass 'all' to the include argument in order to look at both numerical and categorical columns simultaneously:

df.describe(include='all')

Output:

+--------+-----------+----------+------------------+-----------------+---------------------+---------------+-------+
|        | species   | island   |   bill_length_mm |   bill_depth_mm |   flipper_length_mm |   body_mass_g | sex   |
|--------+-----------+----------+------------------+-----------------+---------------------+---------------+-------|
| count  | 344       | 344      |        342       |       342       |            342      |       342     | 333   |
| unique | 3         | 3        |        nan       |       nan       |            nan      |       nan     | 2     |
| top    | Adelie    | Biscoe   |        nan       |       nan       |            nan      |       nan     | MALE  |
| freq   | 152       | 168      |        nan       |       nan       |            nan      |       nan     | 168   |
| mean   | nan       | nan      |         43.9219  |        17.1512  |            200.915  |      4201.75  | nan   |
| std    | nan       | nan      |          5.45958 |         1.97479 |             14.0617 |       801.955 | nan   |
| min    | nan       | nan      |         32.1     |        13.1     |            172      |      2700     | nan   |
| 25%    | nan       | nan      |         39.225   |        15.6     |            190      |      3550     | nan   |
| 50%    | nan       | nan      |         44.45    |        17.3     |            197      |      4050     | nan   |
| 75%    | nan       | nan      |         48.5     |        18.7     |            213      |      4750     | nan   |
| max    | nan       | nan      |         59.6     |        21.5     |            231      |      6300     | nan   |
+--------+-----------+----------+------------------+-----------------+---------------------+---------------+-------+

For categorical columns, describe() generates the following metrics:

  • count — overall number of entries (non-unique);

  • unique — unique quantity of entries;

  • top — the most frequent value;

  • freq — how many times the most frequent value appears in the column.

If you use include='all' you get NaNs for numeric statistics in object columns and vice versa. It doesn't mean that pandas can't calculate them, it's just standard DataFrame.describe() behavior. If you want to find them, use other methods. For example, for a number of unique entries in numeric columns, use DataFrame.nunique(), and so on.

Conclusion

In this topic, you learned how to summarize and evaluate your data in a more efficient way. Now you know how to:

  • find the central tendency measures with .mean(), .median(), .mode()

  • print the statistics both for numerical and object columns with describe().

64 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo