pandas dataframes are one of the most commonly used types of data format. However, we need to change our data in most situations. In this topic, you will learn how to transform pandas dataframes with the .apply() method.
When to use .apply()
There are a lot of cases when you might need to change your data, and the .apply() method is the perfect choice for it. Here are some of them:
Transforming data: Apply various functions (e.g., root, sum, division, or custom formulas) to each element in the dataframe, clean the data (e.g., remove specific characters or replace missing values), or perform text preprocessing (e.g., tokenization, stemming, or lemmatization).
Feature engineering: Create new features by applying a function to columns.
Aggregating data: Aggregate data by applying a function to groups of rows or columns (summary statistics such as mean, median, or mode)
.apply() on rows and columns
Before we start, let's create a pandas dataframe to use for all our future functions:
import pandas as pd
df = pd.DataFrame({
'Name': ['John', 'Jane', 'Bob', 'Mary', 'Ivan'],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Moscow'],
'Age': [32, 25, 47, 19, 45],
'Income': [55000, 72000, 89000, 41000, 45000]
}) Name City Age Income
0 John New York 32 55000
1 Jane Los Angeles 25 72000
2 Bob Chicago 47 89000
3 Mary Houston 19 41000
4 Ivan Moscow 45 45000The shape of our frame is (5, 4), as there are 5 rows and 4 columns. The .apply() method can be used on both columns and rows.
To use .apply() on rows, set axis = 1 (axis = 1 will apply the function to each row). You can change rows and create new columns with the .apply() method. Let's first change all rows:
def change_row(row):
row['Name'] = row['Name'].upper()
row['City'] = row['City'].lower()
row['Age'] = row['Age'] + 10
row['Income'] = row['Income'] * 1.1
return row
df = df.apply(change_row, axis=1)
print(df) Name City Age Income
0 JOHN new york 42 60500.0
1 JANE los angeles 35 79200.0
2 BOB chicago 57 97900.0
3 MARY houston 29 45100.0
4 IVAN moscow 55 49500.0Note that axis = 0 is the default parameter of the .apply() method so you don't need to set it manually all the time. axis = 0 applies the function to each column in the dataframe.
In this function, we have changed all the rows. But you can also create a new column-based feature on the rows. For instance, let's create a tax column.
def add_tax(row):
if row['Income'] > 60000:
tax = row['Income'] * 0.1
else:
tax = row['Income'] * 0.05
return tax
df['Tax'] = df.apply(add_tax, axis=1)
print(df) Name City Age Income Tax
0 JOHN new york 42 60500.0 6050.0
1 JANE los angeles 35 79200.0 7920.0
2 BOB chicago 57 97900.0 9790.0
3 MARY houston 29 45100.0 2255.0
4 IVAN moscow 55 49500.0 2475.0You can also use .apply() on one column only without specifying it in your function. For instance, we can add _Smith to all names in the Name column.
def add_suffix(col, suffix):
return col + suffix
# Apply the function to a single column
df['Name'] = df['Name'].apply(add_suffix, suffix='_Smith')
print(df) Name City Age Income Tax
0 JOHN_Smith new york 42 60500.0 6050.0
1 JANE_Smith los angeles 35 79200.0 7920.0
2 BOB_Smith chicago 57 97900.0 9790.0
3 MARY_Smith houston 29 45100.0 2255.0
4 IVAN_Smith moscow 55 49500.0 2475.0You can also change a few columns:
def add_value(number):
return number + 100
# Here, axis = 0 by default - thus, the function is applied to each column
df[["Income", "Tax"]] = df[["Income", "Tax"]].apply(add_value)
print(df) Name City Age Income Tax
0 JOHN_Smith new york 42 60600.0 6150.0
1 JANE_Smith los angeles 35 79300.0 8020.0
2 BOB_Smith chicago 57 98000.0 9890.0
3 MARY_Smith houston 29 45200.0 2355.0
4 IVAN_Smith moscow 55 49600.0 2575.0Types of functions to use in .apply()
In the examples above, we have created additional functions to be applied to rows and columns. In practice, you can use the .apply() method with a wide range of different functions. Here are some of them:
Built-in functions: you can use any built-in function in Python, such as
max(),min(),len(),sum(), and so on. For example, you could use themax()function to find the maximum value in a row:
df_nums = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
df_nums['MaxValue'] = df_nums.apply(max, axis=1) A B C MaxValue
0 1 4 7 7
1 2 5 8 8
2 3 6 9 9Lambda functions: you can define a lambda function on the fly and use it in the
.apply()method. You can use the previously defined example with_Smithprefix using only a lambda function.
df['Name'] = df['Name'].apply(lambda x: x+'_Smith') Name City Age Income Tax
0 JOHN_Smith_Smith new york 42 60600.0 6150.0
1 JANE_Smith_Smith los angeles 35 79300.0 8020.0
2 BOB_Smith_Smith chicago 57 98000.0 9890.0
3 MARY_Smith_Smith houston 29 45200.0 2355.0
4 IVAN_Smith_Smith moscow 55 49600.0 2575.0Custom functions: you can define and use your own functions and use them in the
.apply()method. We have done this in the previous example with rows and columns transformations.
def calc_new_income(row):
return row['Income'] - row['Tax'] - 100
df['Income_new'] = df.apply(calc_new_income, axis=1) Name City Age Income Tax Income_new
0 JOHN_Smith_Smith new york 42 60600.0 6150.0 54350.0
1 JANE_Smith_Smith los angeles 35 79300.0 8020.0 71180.0
2 BOB_Smith_Smith chicago 57 98000.0 9890.0 88010.0
3 MARY_Smith_Smith houston 29 45200.0 2355.0 42745.0
4 IVAN_Smith_Smith moscow 55 49600.0 2575.0 46925.0Numpy functions: you can also use any function from the
NumPylibrary.NumPyprovides a wide range of mathematical functions, such assin(),cos(),sqrt(), and you can use them as built-in functions.
import numpy as np
df['Tax_sqrt'] = df['Tax'].apply(np.sqrt) Name City Age Income Tax Income_new Tax_sqrt
0 JOHN_Smith_Smith new york 42 60600.0 6150.0 54350.0 78.421936
1 JANE_Smith_Smith los angeles 35 79300.0 8020.0 71180.0 89.554453
2 BOB_Smith_Smith chicago 57 98000.0 9890.0 88010.0 99.448479
3 MARY_Smith_Smith houston 29 45200.0 2355.0 42745.0 48.528342
4 IVAN_Smith_Smith moscow 55 49600.0 2575.0 46925.0 50.744458Pandas functions: you can use other
pandasfunctions in the.apply()method, such asisnull(),to_numeric(), and more.
df['null_tax'] = df['Tax'].apply(pd.isnull) Tax null_tax
0 6150.0 False
1 8020.0 False
2 9890.0 False
3 2355.0 False
4 2575.0 FalseThe result type parameter of .apply()
Another important parameter of the .apply() method is result_type. It can be set as 'expand', 'reduce', or 'broadcast'. They produce different forms of the .apply() method.
'expand'returns aDataFrameorSeriescontaining the output of the applied function for each element of the input dataframe. That means we can create a whole new dataframe like in the example below.
def calculate_tax(income, tax):
return pd.Series({'Tax rate': (income / tax) * 100, 'Tax rank': 10000 - tax})
result_tax = df[['Income', 'Tax']].apply(lambda x: calculate_tax(*x), axis=1, result_type='expand') Tax rate Tax rank
0 985.365854 3850.0
1 988.778055 1980.0
2 990.899899 110.0
3 1919.320594 7645.0
4 1926.213592 7425.0'reduce'returns a reduced single value of the output of the function applied to the rows or columns. You can use it on a new column or a bunch of features.
def sum_row(row):
return row['Income'] + row['Tax']
result_income_sum = df.apply(sum_row, result_type='reduce', axis=1)
0 66750.0
1 87320.0
2 107890.0
3 47555.0
4 52175.0'broadcast'returns the frame of the original shape with the function applied to the rows or columns. This means that the result of the function is repeated for each element of the corresponding row or column. For example, we can set the mean tax and income for everyone.
def mean_of_column(col):
return col.mean()
result = df[['Income', 'Tax']].apply(mean_of_column, result_type='broadcast') Income Tax
0 66540.0 5798.0
1 66540.0 5798.0
2 66540.0 5798.0
3 66540.0 5798.0
4 66540.0 5798.0.apply(), .applymap(), .map(), and np.broadcast()
Apart from .apply(), pandas also has other methods such as .applymap() and .map(), and in Numpy, there is the broadcast() function. Let's understand the differences between them.
.apply()is used to apply a function to each row or column of aDataFrameorSeries. The result is aSeriesorDataFramedepending on the resulting shape..applymap()is used to apply a function element-wise to each value in aDataFrame. The result is aDataFrameof the same shape..map()is used to apply a function element-wise to each value in aSeries. The result is aSeriesof the same shape.np.broadcast()is used to apply a function to aDataFrameorSeriesand broadcast the result to the original shape. The result is aDataFrameorSeriesof the same shape. The function is similar to.apply()with theresult_type = 'broadcast'.
Check the progress of .apply() with the tqdm library
In our examples so far, we have used small frames. However, sometimes we need to perform time-consuming operations. To view the progress of the results, you can use the tqdm library. Here is how you install it:
pip install tqdmYou can also use tqdm within each custom function. On the other hand, you can see the progress of any .apply() method. To do this, import the library, set tqdm.pandas(), and change .apply() to .progress_apply(). Our last example will look like this:
from tqdm import tqdm
tqdm.pandas()
result = df[['Income', 'Tax']].progress_apply(mean_of_column, result_type='broadcast')
# here you will see this progress bar, we had two columns, so 2 is the number on the progress bar
# 100%|██████████| 2/2 [00:00<00:00, 1981.72it/s]Conclusion
In this topic, we discussed the pandas .apply() method and its various parameters. The .apply() method allows you to apply a function to a DataFrame or Series in order to transform or manipulate the data. It's a flexible tool you can use with a wide range of input functions and parameters to control the output format.