pandas is famous for its analytical tools. When we need to compare datasets or evaluate the data that is grouped by categories, we need to refer to the summary statistics. Entry-by-entry data comparison is impractical. Moreover, an analyst should describe the myriad of small data discrepancies in more general terms. Let's take a look at the most basic possible discrepancies – quantity discrepancies.
Counting
At first, we need to import pandas. In this topic, we are going to use the Palmer penguins dataset. You can import it with one line from GitHub (requires internet connection):
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')
Column names are self-explanatory:
df.head(3)
Output:
+----+-----------+-----------+------------------+-----------------+---------------------+---------------+--------+
| | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
|----+-----------+-----------+------------------+-----------------+---------------------+---------------+--------|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | MALE |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | FEMALE |
| 2 | Adelie | Torgersen | 40.3 | 18 | 195 | 3250 | FEMALE |
+----+-----------+-----------+------------------+-----------------+---------------------+---------------+--------+
Once the dataset is imported, we need to count the following values:
- Non-null values,
- Missing values,
- All values, including the missing values,
- The number of unique values.
We can calculate those values not only for the numeric but also for categorical data.
1. Non-nulls
How many non-NA values do we have in the column "sex"? To make it really neat and simple, use the Series.count() method:
df.sex.count()
Output:
333
2. Nulls
To count the missing values in the column "sex", we need to take a bit more complex approach. By using Series.isna(), we'll get a boolean object, where NaN values are mapped as True, and everything else as False.
df.sex.isna()
Output:
0 False
1 False
2 False
3 True
4 False
...
339 True
340 False
341 False
342 False
343 False
Name: sex, Length: 344, dtype: bool
Keep in mind that True and False in Python are like 1 and 0, respectively. So, to count all the missing values, we need to find the sum of the .isna() output. For this purpose, we'll use DataFrame.sum(). Let's combine those two methods in one line:
df.sex.isna().sum()
Output:
11
The "sex" column contains eleven missing values.
3. All values
Next, we need to count the total number of observations. The overall number of cells in a column is the column length. To output it, we will use the DataFrame.shape attribute and take its first element (the number of rows):
df.shape[0]
Output:
344
We have 344 observations in total.
.shape is an attribute, not a method. You cannot call it, so no parentheses are needed. 4. Unique values
Suppose we want to find the number of unique penguin species that were studied on the Palmer Archipelago. It is similar to COUNT DISTINCT used in SQL. To do it, we will use the Series.nunique() method:
df.species.nunique()
Output:
3
There are three species. You could have guessed it already by looking at the picture at the start of this topic :)
Series.value_counts & Series.unique
You can also easily find the frequency of each unique value. It can be useful for categorical columns, like species. Let's find how many specimens of each species are included in the dataset using .value_counts():
df.species.value_counts()
Output:
Adelie 152
Gentoo 124
Chinstrap 68
Name: species, dtype: int64
This is a very useful method! You don't need pandas grouping or aggregations. Note that the result is a Series with a unique value as the index and a number as the value. So we can, for example, find particular entries in it:
df.species.value_counts().loc[['Adelie', 'Chinstrap']]
Output:
Adelie 152
Chinstrap 68
Name: species, dtype: int64
But sometimes what we need is just a list of unique values. To achieve it, use .unique():
df.species.unique()
Output:
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)
It produces the list. Note that it is a NumPy array, but you can iterate through it or index it like a simple Python list. For example, let's add "The" and "penguin" to it:
[('The ' + species + ' penguin') for species in df.species.unique()]
Output:
['The Adelie penguin', 'The Chinstrap penguin', 'The Gentoo penguin']
Neat! Now you are ready to move on to more complex comparing measures.
Conclusion
Now you know how to:
- count null, non-null, distinct, and all values with
.isna().sum(),.count(),.nunique(),and.shape[0]methods respectively; - count the unique values with the
value_counts()method.