Computer scienceData scienceInstrumentsPandasData preprocessing with pandas

DataFrame.apply()

14 minutes read

pandas dataframes are one of the most commonly used types of data format. However, we need to change our data in most situations. In this topic, you will learn how to transform pandas dataframes with the .apply() method.

When to use .apply()

There are a lot of cases when you might need to change your data, and the .apply() method is the perfect choice for it. Here are some of them:

Transforming data: Apply various functions (e.g., root, sum, division, or custom formulas) to each element in the dataframe, clean the data (e.g., remove specific characters or replace missing values), or perform text preprocessing (e.g., tokenization, stemming, or lemmatization).
Feature engineering: Create new features by applying a function to columns.
Aggregating data: Aggregate data by applying a function to groups of rows or columns (summary statistics such as mean, median, or mode)

.apply() on rows and columns

Before we start, let's create a pandas dataframe to use for all our future functions:

import pandas as pd

df = pd.DataFrame({
   'Name': ['John', 'Jane', 'Bob', 'Mary', 'Ivan'],
   'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Moscow'],
   'Age': [32, 25, 47, 19, 45],
   'Income': [55000, 72000, 89000, 41000, 45000]
})

   Name         City  Age  Income
0  John     New York   32   55000
1  Jane  Los Angeles   25   72000
2   Bob      Chicago   47   89000
3  Mary      Houston   19   41000
4  Ivan      Moscow    45   45000

The shape of our frame is (5, 4), as there are 5 rows and 4 columns. The .apply() method can be used on both columns and rows.

To use .apply() on rows, set axis = 1 (axis = 1 will apply the function to each row). You can change rows and create new columns with the .apply() method. Let's first change all rows:

def change_row(row):
    row['Name'] = row['Name'].upper()
    row['City'] = row['City'].lower()
    row['Age'] = row['Age'] + 10
    row['Income'] = row['Income'] * 1.1
    return row

df = df.apply(change_row, axis=1)
print(df)

   Name         City  Age   Income
0  JOHN     new york   42  60500.0
1  JANE  los angeles   35  79200.0
2   BOB      chicago   57  97900.0
3  MARY      houston   29  45100.0
4  IVAN       moscow   55  49500.0

Note that axis = 0 is the default parameter of the .apply() method so you don't need to set it manually all the time. axis = 0 applies the function to each column in the dataframe.

In this function, we have changed all the rows. But you can also create a new column-based feature on the rows. For instance, let's create a tax column.

def add_tax(row):
    if row['Income'] > 60000:
        tax = row['Income'] * 0.1
    else:
        tax = row['Income'] * 0.05
    return tax

df['Tax'] = df.apply(add_tax, axis=1)
print(df)

   Name         City  Age   Income     Tax
0  JOHN     new york   42  60500.0  6050.0
1  JANE  los angeles   35  79200.0  7920.0
2   BOB      chicago   57  97900.0  9790.0
3  MARY      houston   29  45100.0  2255.0
4  IVAN       moscow   55  49500.0  2475.0

You can also use .apply() on one column only without specifying it in your function. For instance, we can add _Smith to all names in the Name column.

def add_suffix(col, suffix):
    return col + suffix

# Apply the function to a single column
df['Name'] = df['Name'].apply(add_suffix, suffix='_Smith')
print(df)

         Name         City  Age   Income     Tax
0  JOHN_Smith     new york   42  60500.0  6050.0
1  JANE_Smith  los angeles   35  79200.0  7920.0
2   BOB_Smith      chicago   57  97900.0  9790.0
3  MARY_Smith      houston   29  45100.0  2255.0
4  IVAN_Smith       moscow   55  49500.0  2475.0

You can also change a few columns:

def add_value(number):
    return number + 100


# Here, axis = 0 by default - thus, the function is applied to each column
df[["Income", "Tax"]] = df[["Income", "Tax"]].apply(add_value)
print(df)

         Name         City  Age   Income     Tax
0  JOHN_Smith     new york   42  60600.0  6150.0
1  JANE_Smith  los angeles   35  79300.0  8020.0
2   BOB_Smith      chicago   57  98000.0  9890.0
3  MARY_Smith      houston   29  45200.0  2355.0
4  IVAN_Smith       moscow   55  49600.0  2575.0

Types of functions to use in .apply()

In the examples above, we have created additional functions to be applied to rows and columns. In practice, you can use the .apply() method with a wide range of different functions. Here are some of them:

Built-in functions: you can use any built-in function in Python, such as max(), min(), len(), sum(), and so on. For example, you could use the max() function to find the maximum value in a row:

df_nums = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

df_nums['MaxValue'] = df_nums.apply(max, axis=1)

    A  B  C  MaxValue
 0  1  4  7         7
 1  2  5  8         8
 2  3  6  9         9

Lambda functions: you can define a lambda function on the fly and use it in the .apply() method. You can use the previously defined example with _Smith prefix using only a lambda function.

df['Name'] = df['Name'].apply(lambda x: x+'_Smith')

                Name         City  Age   Income     Tax
 0  JOHN_Smith_Smith     new york   42  60600.0  6150.0
 1  JANE_Smith_Smith  los angeles   35  79300.0  8020.0
 2   BOB_Smith_Smith      chicago   57  98000.0  9890.0
 3  MARY_Smith_Smith      houston   29  45200.0  2355.0
 4  IVAN_Smith_Smith       moscow   55  49600.0  2575.0

Custom functions: you can define and use your own functions and use them in the .apply() method. We have done this in the previous example with rows and columns transformations.

def calc_new_income(row):
    return row['Income'] - row['Tax'] - 100

df['Income_new'] = df.apply(calc_new_income, axis=1)

               Name         City  Age   Income     Tax  Income_new
 0  JOHN_Smith_Smith     new york   42  60600.0  6150.0     54350.0
 1  JANE_Smith_Smith  los angeles   35  79300.0  8020.0     71180.0
 2   BOB_Smith_Smith      chicago   57  98000.0  9890.0     88010.0
 3  MARY_Smith_Smith      houston   29  45200.0  2355.0     42745.0
 4  IVAN_Smith_Smith       moscow   55  49600.0  2575.0     46925.0

Numpy functions: you can also use any function from the NumPy library. NumPy provides a wide range of mathematical functions, such as sin(), cos(), sqrt(), and you can use them as built-in functions.

import numpy as np

df['Tax_sqrt'] = df['Tax'].apply(np.sqrt)

               Name         City  Age   Income     Tax  Income_new   Tax_sqrt
0  JOHN_Smith_Smith     new york   42  60600.0  6150.0     54350.0  78.421936
1  JANE_Smith_Smith  los angeles   35  79300.0  8020.0     71180.0  89.554453
2   BOB_Smith_Smith      chicago   57  98000.0  9890.0     88010.0  99.448479
3  MARY_Smith_Smith      houston   29  45200.0  2355.0     42745.0  48.528342
4  IVAN_Smith_Smith       moscow   55  49600.0  2575.0     46925.0  50.744458

Pandas functions: you can use other pandas functions in the .apply() method, such as isnull(), to_numeric(), and more.

df['null_tax'] = df['Tax'].apply(pd.isnull)

      Tax  null_tax
0  6150.0     False
1  8020.0     False
2  9890.0     False
3  2355.0     False
4  2575.0     False

The result type parameter of .apply()

Another important parameter of the .apply() method is result_type. It can be set as 'expand', 'reduce', or 'broadcast'. They produce different forms of the .apply() method.

'expand' returns a DataFrame or Series containing the output of the applied function for each element of the input dataframe. That means we can create a whole new dataframe like in the example below.

def calculate_tax(income, tax):
    return pd.Series({'Tax rate': (income / tax) * 100, 'Tax rank': 10000 - tax})

result_tax = df[['Income', 'Tax']].apply(lambda x: calculate_tax(*x), axis=1, result_type='expand')

       Tax rate  Tax rank
 0   985.365854    3850.0
 1   988.778055    1980.0
 2   990.899899     110.0
 3  1919.320594    7645.0
 4  1926.213592    7425.0

'reduce' returns a reduced single value of the output of the function applied to the rows or columns. You can use it on a new column or a bunch of features.

def sum_row(row):
    return row['Income'] + row['Tax']

result_income_sum = df.apply(sum_row, result_type='reduce', axis=1)


 0     66750.0
 1     87320.0
 2    107890.0
 3     47555.0
 4     52175.0

'broadcast' returns the frame of the original shape with the function applied to the rows or columns. This means that the result of the function is repeated for each element of the corresponding row or column. For example, we can set the mean tax and income for everyone.

def mean_of_column(col):
    return col.mean()

result = df[['Income', 'Tax']].apply(mean_of_column, result_type='broadcast')

     Income     Tax
 0  66540.0  5798.0
 1  66540.0  5798.0
 2  66540.0  5798.0
 3  66540.0  5798.0
 4  66540.0  5798.0

.apply(), .applymap(), .map(), and np.broadcast()

Apart from .apply(), pandas also has other methods such as .applymap() and .map(), and in Numpy, there is the broadcast() function. Let's understand the differences between them.

.apply() is used to apply a function to each row or column of a DataFrame or Series. The result is a Series or DataFrame depending on the resulting shape.
.applymap() is used to apply a function element-wise to each value in a DataFrame. The result is a DataFrame of the same shape.
.map() is used to apply a function element-wise to each value in a Series. The result is a Series of the same shape.
np.broadcast() is used to apply a function to a DataFrame or Series and broadcast the result to the original shape. The result is a DataFrame or Series of the same shape. The function is similar to .apply() with the result_type = 'broadcast'.

Check the progress of .apply() with the tqdm library

In our examples so far, we have used small frames. However, sometimes we need to perform time-consuming operations. To view the progress of the results, you can use the tqdm library. Here is how you install it:

pip install tqdm

You can also use tqdm within each custom function. On the other hand, you can see the progress of any .apply() method. To do this, import the library, set tqdm.pandas(), and change .apply() to .progress_apply(). Our last example will look like this:

from tqdm import tqdm

tqdm.pandas()

result = df[['Income', 'Tax']].progress_apply(mean_of_column, result_type='broadcast')

# here you will see this progress bar, we had two columns, so 2 is the number on the progress bar

# 100%|██████████| 2/2 [00:00<00:00, 1981.72it/s]

Conclusion

In this topic, we discussed the pandas .apply() method and its various parameters. The .apply() method allows you to apply a function to a DataFrame or Series in order to transform or manipulate the data. It's a flexible tool you can use with a wide range of input functions and parameters to control the output format.

10 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo