Aggregation is very important for data analysis. We use it when we need to provide an assessment of our dataset values. By performing aggregation, we transform our data into information.
In this topic, we are going to use the Palmer Penguins dataset as an example. You can import it from GitHub using the following lines (internet connection required):
import pandas as pd
df = pd.read_csv(
'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')The title of each column speaks for itself:
df.head(3)Output:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALEDataFrame.aggregate
.aggregate() is a pandas DataFrame and Series method that is used for data aggregation. It can be used for one or many functions along any axis. To save some time for Really Important Things (like Machine Learning or Netflix), you may want to use the shorter version of the method — .agg(). Let's have a look at the following examples.
Suppose we want to find the median value of the penguin's body mass. You can do it by applying the .agg() method to the desired series:
df.body_mass_g.agg('median')Output:
4050.0Four kilos of black-and-white antarctic cuteness!
Tip: You can pass a function, the function name (as a string), a list of functions (or their names) or a dict to .agg(), for example, 'unique'in df.body_mass_g.agg('unique') will get the unique elements from body_mass_g column, 'nunique' will count the unique elements in the specified column, and 'sum' in df.body_mass_g.agg('sum')will produce the sum of body_mass_g values.
Also, you can pass a function like so (here we pass a Python built-in sum(), but any function that handles a DataFrame can be passed):
df.body_mass_g.agg(sum)We also could import numpy, run df.body_mass_g.agg(numpy.sum) and obtain the same result.
Another way to do it is by calling the .median() method from the Series:
df.body_mass_g.median()The difference is that by using the .agg() method, we are also able to apply several aggregating functions to different columns. For example, let's find the shortest and the average penguin bill as well as the longest and the average flipper. To do this, we need to call agg() and create a Python dictionary using the columns as dictionary keys. Also, we need to put the aggregating functions in lists:
df.agg({'bill_length_mm': ['min', 'mean'],
'flipper_length_mm': ['max', 'mean']
})Output:
bill_length_mm flipper_length_mm
min 32.10000 NaN
mean 43.92193 200.915205
max NaN 231.000000Tip: The function outputs NaN for the values we have not specified.
It's also possible to aggregate the data with your own functions. The example of the function below outputs the number of missing values. If the set contains no such values, it puts 0 (by default) with the help of the isna() function:
def count_nulls(series, ok_message=0):
if not series.isna().sum():
return ok_message
return len(series) - series.count()When using this custom function, all the values of the group are passed to the function as a Series object via the series argument. Let's use it in our DataFrame:
df.agg(count_nulls)Output:
species 0
island 0
bill_length_mm 2
bill_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 11
dtype: int64If you need to pass a parameter to the agg() function, just list their names and values separated by a comma after the function name:
df.agg(count_nulls, ok_message='Hurray!')Output:
species Hurray!
island Hurray!
bill_length_mm 2
bill_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 11
dtype: objectTip: If you use non-pandas functions (NumPy or your own) — don't put their names in quotes.
If you want to find the biggest value out of a specific set of features for each penguin (suppose we want to look at the biggest value out of 3 features — bill_length_mm, bill_depth_mm, and flipper_length_mm), apply .agg() over the columns by setting the axis argument to 'columns'.
We can also call the .agg() method to work with a data frame that contains only numeric values. Let's use the Python built-in max() function. It results in a pandas Series:
df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']].agg(max,
axis='columns')Output:
0 181.0
1 186.0
2 195.0
3 NaN
4 193.0
...
339 NaN
340 215.0
341 222.0
342 212.0
343 213.0
Length: 344, dtype: float64DataFrame.groupby
We can get a statistical output from different columns with aggregation, but to understand the data deeper, we need to have a closer look at various slices and combinations of columns. For this purpose, we can use groupby(). It's a very simple tool, especially for those who are already familiar with SQL.
Now let's check the median bill length for females and males. Group all penguins by their sex and aggregate them with one line:
df.groupby(['sex']).agg({'bill_length_mm':'median'})Output:
bill_length_mm
sex
FEMALE 42.8
MALE 46.8
The df.groupby(['sex']) alone outputs something like "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002A1E7BE8358>". That's because it's just an object in memory with split values for indicated groups. The interpreter doesn't know what to output unless you directly specify it with .agg().
As you've noticed, the "sex" column contains 11 missing values (some penguins preferred not to share their gender). To include them in our grouping, set the groupby dropna argument to False (pandas 1.1 or higher required):
df.groupby('sex', dropna=False).agg({'bill_length_mm':'median'})Output:
bill_length_mm
sex
FEMALE 42.8
MALE 46.8
NaN 42.0Now we can assume that penguins of unknown sex are mostly females.
Next, let's suppose that the penguin's diet varies by island, and it affects the bill length. To check whether it makes sense or not, group our Antarctic friends by island and sex. Keep in mind that when we need to pass several arguments, we need to pass a list:
df.groupby(['island', 'sex']).agg({'bill_length_mm':'median'})Output:
bill_length_mm
island sex
Biscoe FEMALE 44.9
MALE 48.5
Dream FEMALE 42.5
MALE 49.1
Torgersen FEMALE 37.6
MALE 41.1The argument order in the .groupby() argument list determines the grouping order. Groups come first, then subgroups, subgroups of subgroups, and so on.
Tip: As you've seen thus far, groupbywill make the group labels you pass into it into index columns — the as_index parameter is responsible for that, and is set to True by default. For our last example, if we call .index.names, we can see that we have 2 indexes — 'island' and 'sex':
df.groupby(['island', 'sex']).agg({'bill_length_mm':'median'}).index.namesThe output will look like this: FrozenList(['island', 'sex'])
In case you want to avoid setting the index(es) to the group labels passed into .groupby(), you have to set as_index to False:
df.groupby(['island', 'sex'], as_index = False).agg({'bill_length_mm':'median'})
Conclusion
In this topic, you have learned how to:
calculate the column summary statistics with the help of
.agg()separate dataset into groups with the help of
.groupby().
These pandas methods can help you find outliers in the data, evaluate a series of observations, and compare groups against the desired statistical feature.
Nevertheless, bear in mind the possible result, when you conduct a complicated aggregation and grouping of real data. One diagram drawn on paper can save hours of fixing the code. If you keep getting errors and wrong answers, it probably has nothing to do with Python usage — your logic might be faulty.