Computer scienceData scienceInstrumentsPandasData analysis with pandas

Summarizing categorical columns

9 minutes read

pandas is famous for its analytical tools. When we need to compare datasets or evaluate the data that is grouped by categories, we need to refer to the summary statistics. Entry-by-entry data comparison is impractical. Moreover, an analyst should describe the myriad of small data discrepancies in more general terms. Let's take a look at the most basic possible discrepancies – quantity discrepancies.

Counting

At first, we need to import pandas. In this topic, we are going to use the Palmer penguins dataset. You can import it with one line from GitHub (requires internet connection):

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')

Column names are self-explanatory:

df.head(3)

Output:

+----+-----------+-----------+------------------+-----------------+---------------------+---------------+--------+
|    | species   | island    |   bill_length_mm |   bill_depth_mm |   flipper_length_mm |   body_mass_g | sex    |
|----+-----------+-----------+------------------+-----------------+---------------------+---------------+--------|
|  0 | Adelie    | Torgersen |             39.1 |            18.7 |                 181 |          3750 | MALE   |
|  1 | Adelie    | Torgersen |             39.5 |            17.4 |                 186 |          3800 | FEMALE |
|  2 | Adelie    | Torgersen |             40.3 |            18   |                 195 |          3250 | FEMALE |
+----+-----------+-----------+------------------+-----------------+---------------------+---------------+--------+

The R Palmer Penguins logo

Once the dataset is imported, we need to count the following values:

  1. Non-null values,
  2. Missing values,
  3. All values, including the missing values,
  4. The number of unique values.

We can calculate those values not only for the numeric but also for categorical data.

1. Non-nulls

How many non-NA values do we have in the column "sex"? To make it really neat and simple, use the Series.count() method:

df.sex.count()

Output:

333

2. Nulls

To count the missing values in the column "sex", we need to take a bit more complex approach. By using Series.isna(), we'll get a boolean object, where NaN values are mapped as True, and everything else as False.

df.sex.isna()

Output:

0      False
1      False
2      False
3       True
4      False
       ...  
339     True
340    False
341    False
342    False
343    False
Name: sex, Length: 344, dtype: bool

Keep in mind that True and False in Python are like 1 and 0, respectively. So, to count all the missing values, we need to find the sum of the .isna() output. For this purpose, we'll use DataFrame.sum(). Let's combine those two methods in one line:

df.sex.isna().sum()

Output:

11

The "sex" column contains eleven missing values.

3. All values

Next, we need to count the total number of observations. The overall number of cells in a column is the column length. To output it, we will use the DataFrame.shape attribute and take its first element (the number of rows):

df.shape[0]

Output:

344

We have 344 observations in total.

Note that .shape is an attribute, not a method. You cannot call it, so no parentheses are needed.

4. Unique values

Suppose we want to find the number of unique penguin species that were studied on the Palmer Archipelago. It is similar to COUNT DISTINCT used in SQL. To do it, we will use the Series.nunique() method:

df.species.nunique()

Output:

3

There are three species. You could have guessed it already by looking at the picture at the start of this topic :)

Series.value_counts & Series.unique

You can also easily find the frequency of each unique value. It can be useful for categorical columns, like species. Let's find how many specimens of each species are included in the dataset using .value_counts():

df.species.value_counts()

Output:

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

This is a very useful method! You don't need pandas grouping or aggregations. Note that the result is a Series with a unique value as the index and a number as the value. So we can, for example, find particular entries in it:

df.species.value_counts().loc[['Adelie', 'Chinstrap']]

Output:

Adelie       152
Chinstrap     68
Name: species, dtype: int64

But sometimes what we need is just a list of unique values. To achieve it, use .unique():

df.species.unique()

Output:

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

It produces the list. Note that it is a NumPy array, but you can iterate through it or index it like a simple Python list. For example, let's add "The" and "penguin" to it:

[('The ' + species + ' penguin') for species in df.species.unique()]

Output:

['The Adelie penguin', 'The Chinstrap penguin', 'The Gentoo penguin']

Neat! Now you are ready to move on to more complex comparing measures.

Conclusion

Now you know how to:

  • count null, non-null, distinct, and all values with .isna().sum(), .count(), .nunique(), and .shape[0] methods respectively;
  • count the unique values with the value_counts() method.
62 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo