Computer scienceData scienceInstrumentsPandasData analysis with pandas

Grouping and aggregating data in pandas

14 minutes read

Aggregation is very important for data analysis. We use it when we need to provide an assessment of our dataset values. By performing aggregation, we transform our data into information.

In this topic, we are going to use the Palmer Penguins dataset as an example. You can import it from GitHub using the following lines (internet connection required):

import pandas as pd
df = pd.read_csv(
'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')

The title of each column speaks for itself:

df.head(3)

Output:

 	species    island 	bill_length_mm 	bill_depth_mm 	flipper_length_mm 	body_mass_g 	sex
0 	Adelie 	Torgersen 	          39.1 	         18.7 	            181.0 	     3750.0    MALE
1 	Adelie 	Torgersen 	          39.5 	         17.4 	            186.0 	     3800.0  FEMALE
2 	Adelie 	Torgersen 	          40.3 	         18.0 	            195.0 	     3250.0  FEMALE

The R Palmer Penguins data package logo

DataFrame.aggregate

.aggregate() is a pandas DataFrame and Series method that is used for data aggregation. It can be used for one or many functions along any axis. To save some time for Really Important Things (like Machine Learning or Netflix), you may want to use the shorter version of the method — .agg(). Let's have a look at the following examples.

Suppose we want to find the median value of the penguin's body mass. You can do it by applying the .agg() method to the desired series:

df.body_mass_g.agg('median')

Output:

4050.0

Four kilos of black-and-white antarctic cuteness!

Tip: You can pass a function, the function name (as a string), a list of functions (or their names) or a dict to .agg(), for example, 'unique'in df.body_mass_g.agg('unique') will get the unique elements from body_mass_g column, 'nunique' will count the unique elements in the specified column, and 'sum' in df.body_mass_g.agg('sum')will produce the sum of body_mass_g values.

Also, you can pass a function like so (here we pass a Python built-in sum(), but any function that handles a DataFrame can be passed):

df.body_mass_g.agg(sum)

We also could import numpy, run df.body_mass_g.agg(numpy.sum) and obtain the same result.

Another way to do it is by calling the .median() method from the Series:

df.body_mass_g.median()

The difference is that by using the .agg() method, we are also able to apply several aggregating functions to different columns. For example, let's find the shortest and the average penguin bill as well as the longest and the average flipper. To do this, we need to call agg() and create a Python dictionary using the columns as dictionary keys. Also, we need to put the aggregating functions in lists:

df.agg({'bill_length_mm': ['min', 'mean'],
         'flipper_length_mm': ['max', 'mean']
        })

Output:

 	bill_length_mm 	flipper_length_mm
 min 	  32.10000 	              NaN
mean 	  43.92193 	       200.915205
 max 	       NaN 	       231.000000

Tip: The function outputs NaN for the values we have not specified.

It's also possible to aggregate the data with your own functions. The example of the function below outputs the number of missing values. If the set contains no such values, it puts 0 (by default) with the help of the isna() function:

def count_nulls(series, ok_message=0):
    if not series.isna().sum():
        return ok_message
    return len(series) - series.count()

When using this custom function, all the values of the group are passed to the function as a Series object via the series argument. Let's use it in our DataFrame:

df.agg(count_nulls)

Output:

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

If you need to pass a parameter to the agg() function, just list their names and values separated by a comma after the function name:

df.agg(count_nulls, ok_message='Hurray!')

Output:

species              Hurray!
island               Hurray!
bill_length_mm             2
bill_depth_mm              2
flipper_length_mm          2
body_mass_g                2
sex                       11
dtype: object

Tip: If you use non-pandas functions (NumPy or your own) — don't put their names in quotes.

If you want to find the biggest value out of a specific set of features for each penguin (suppose we want to look at the biggest value out of 3 features — bill_length_mm, bill_depth_mm, and flipper_length_mm), apply .agg() over the columns by setting the axis argument to 'columns'.

We can also call the .agg() method to work with a data frame that contains only numeric values. Let's use the Python built-in max() function. It results in a pandas Series:

df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']].agg(max, 
axis='columns')

Output:

0      181.0
1      186.0
2      195.0
3        NaN
4      193.0
       ...  
339      NaN
340    215.0
341    222.0
342    212.0
343    213.0
Length: 344, dtype: float64

DataFrame.groupby

We can get a statistical output from different columns with aggregation, but to understand the data deeper, we need to have a closer look at various slices and combinations of columns. For this purpose, we can use groupby(). It's a very simple tool, especially for those who are already familiar with SQL.

Now let's check the median bill length for females and males. Group all penguins by their sex and aggregate them with one line:

df.groupby(['sex']).agg({'bill_length_mm':'median'})

Output:

       bill_length_mm
   sex 	
FEMALE 	         42.8
  MALE 	         46.8

The df.groupby(['sex']) alone outputs something like "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002A1E7BE8358>". That's because it's just an object in memory with split values for indicated groups. The interpreter doesn't know what to output unless you directly specify it with .agg().

As you've noticed, the "sex" column contains 11 missing values (some penguins preferred not to share their gender). To include them in our grouping, set the groupby dropna argument to False (pandas 1.1 or higher required):

df.groupby('sex', dropna=False).agg({'bill_length_mm':'median'})

Output:

       bill_length_mm
   sex 	
FEMALE 	         42.8
  MALE 	         46.8
   NaN 	         42.0

Now we can assume that penguins of unknown sex are mostly females.

Next, let's suppose that the penguin's diet varies by island, and it affects the bill length. To check whether it makes sense or not, group our Antarctic friends by island and sex. Keep in mind that when we need to pass several arguments, we need to pass a list:

df.groupby(['island', 'sex']).agg({'bill_length_mm':'median'})

Output:

                  bill_length_mm
island    sex                   
Biscoe    FEMALE            44.9
          MALE              48.5
Dream     FEMALE            42.5
          MALE              49.1
Torgersen FEMALE            37.6
          MALE              41.1

The argument order in the .groupby() argument list determines the grouping order. Groups come first, then subgroups, subgroups of subgroups, and so on.

Tip: As you've seen thus far, groupbywill make the group labels you pass into it into index columns — the as_index parameter is responsible for that, and is set to True by default. For our last example, if we call .index.names, we can see that we have 2 indexes — 'island' and 'sex':

df.groupby(['island', 'sex']).agg({'bill_length_mm':'median'}).index.names

The output will look like this: FrozenList(['island', 'sex'])

In case you want to avoid setting the index(es) to the group labels passed into .groupby(), you have to set as_index to False:

df.groupby(['island', 'sex'], as_index = False).agg({'bill_length_mm':'median'})

Conclusion

In this topic, you have learned how to:

  • calculate the column summary statistics with the help of .agg()

  • separate dataset into groups with the help of .groupby().

These pandas methods can help you find outliers in the data, evaluate a series of observations, and compare groups against the desired statistical feature.

Nevertheless, bear in mind the possible result, when you conduct a complicated aggregation and grouping of real data. One diagram drawn on paper can save hours of fixing the code. If you keep getting errors and wrong answers, it probably has nothing to do with Python usage — your logic might be faulty.

69 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo