Computer scienceData scienceInstrumentsPandasData analysis with pandas

Crosstab in pandas

11 minutes read

Pandas.crosstab() computes a frequency table, which shows the number of observations for each variable combination for two or more variables. It shows the distribution of observations across different categories and the relationship between two categorical variables.

Pandas.crosstab

The functionality of the crosstab() function is very similar to the .pivot_table() method. If you look into its source code, you can see the .pivot_table() there. The two main differences are:

By default, if you don't pass the values of parameters values and aggfunc, the crosstab() builds a frequency (aka a contingency) table.
crosstab() is more suitable for creating a table from various indexed iterables, while .pivot_table() is for creating a table from the existing dataframes.

Let's see how it works with the Auto-mpg dataset.

import pandas as pd

df = pd.read_csv("auto-mpg.csv")

The first rows are as follows:

+----+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+---------------------------+
|    |   mpg |   cylinders |   displacement |   horsepower |   weight |   acceleration |   model year |   origin | car name                  |
|----+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+---------------------------|
|  0 |    18 |           8 |            307 |          130 |     3504 |           12   |           70 |        1 | chevrolet chevelle malibu |
|  1 |    15 |           8 |            350 |          165 |     3693 |           11.5 |           70 |        1 | buick skylark 320         |
|  2 |    18 |           8 |            318 |          150 |     3436 |           11   |           70 |        1 | plymouth satellite        |
|  3 |    16 |           8 |            304 |          150 |     3433 |           12   |           70 |        1 | amc rebel sst             |
|  4 |    17 |           8 |            302 |          140 |     3449 |           10.5 |           70 |        1 | ford torino               |
+----+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+---------------------------+

In general, crosstab() not necessarily feeds the Pandas series. It can compute any explicitly indexed sequences (e.g. Python dictionaries or NumPy arrays, or any array-like types).

Let's find how many times in what year with how many cylinders the engines were used. First, index corresponds to the values to group by in the rows, and columns are the values to group by in the columns of the future frequency table. crosstab() returns a new dataframe, we can save it in a new variable and access the individual elements with either .loc or .iloc as:

ct = pd.crosstab(index=df.cylinders, columns=df.model_year)
value_iloc = cross.iloc[3, 0] # we access the value of the 4th row and the 1st column 
value_loc = cross.loc[6, 70]  # we access the same element - but 6 corresponds to the 'cylinders' and 70 corresponds to the 'model_year'

print(ct)
print(f'cylinders = 6; model_year = 70 with iloc: {value_iloc}')
print(f'cylinders = 6; model_year = 70 with loc: {value_loc}')

Output:

model_year  70  71  72  73  74  75  76  77  78  79  80  81  82
cylinders                                                     
3            0   0   1   1   0   0   0   1   0   0   1   0   0
4            7  13  14  11  15  12  15  14  17  12  25  21  28
5            0   0   0   0   0   0   0   0   1   1   1   0   0
6            4   8   0   8   7  12  10   5  12   6   2   7   3
8           18   7  13  20   5   6   9   8   6  10   0   1   0

cylinders = 6; model_year = 70 with iloc: 4
cylinders = 6; model_year = 70 with loc: 4

Now it's easy to find that the production of the 6-cylinder engines was very popular between 1975 and 1978. For the 8-cylinder — between 1970 and 1973.

Achieving the same result with .pivot_table() only is going to be a tall order:

df.pivot_table(index = 'cylinders', 
               columns='model_year', 
               values='car_name', 
               aggfunc='count').fillna(0).astype(int)

Output:

model_year  70  71  72  73  74  75  76  77  78  79  80  81  82
cylinders                                                     
3            0   0   1   1   0   0   0   1   0   0   1   0   0
4            7  13  14  11  15  12  15  14  17  12  25  21  28
5            0   0   0   0   0   0   0   0   1   1   1   0   0
6            4   8   0   8   7  12  10   5  12   6   2   7   3
8           18   7  13  20   5   6   9   8   6  10   0   1   0

A few words about the parameters. When analyzing a cross table, it is convenient to see the total by columns and rows. For this, the argument margins=True is used and the default totals are marked as All:

pd.crosstab(index=df.cylinders, columns=df.model_year, margins=True)

Output:

model_year  70  71  72  73  74  75  76  77  78  79  80  81  82  All
cylinders                                                          
3            0   0   1   1   0   0   0   1   0   0   1   0   0    4
4            7  13  14  11  15  12  15  14  17  12  25  21  28  204
5            0   0   0   0   0   0   0   0   1   1   1   0   0    3
6            4   8   0   8   7  12  10   5  12   6   2   7   3   84
8           18   7  13  20   5   6   9   8   6  10   0   1   0  103
All         29  28  28  40  27  30  34  28  36  29  29  29  31  398

Also, it often happens that you need to know not the absolute, but the relative frequency of objects (in fractions of the total number). Add the argument normalize=True and round to the second decimal place:

pd.crosstab(index=df.cylinders, columns=df.model_year, margins=True, normalize=True).round(2)

Output:

model_year    70    71    72    73    74    75    76    77    78    79    80    81    82   All 
cylinders                                                                      
3           0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.01
4           0.02  0.03  0.04  0.03  0.04  0.03  0.04  0.04  0.04  0.03  0.06  0.05  0.07  0.51 
5           0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.01 
6           0.01  0.02  0.00  0.02  0.02  0.03  0.03  0.01  0.03  0.02  0.01  0.02  0.01  0.21 
8           0.05  0.02  0.03  0.05  0.01  0.02  0.02  0.02  0.02  0.03  0.00  0.00  0.00  0.26 
All         0.07  0.07  0.07  0.10  0.07  0.08  0.09  0.07  0.09  0.07  0.07  0.07  0.08  1.00

As you can see from the bottom-right cell, 100% here is the total number of engines by all years and cylinders. It's not very informative. If we explore the cylinder number by production years, it'll be much more interesting to see the part from the total output of each particular engine recorded for all years. To show the total for each index value (each number of cylinders), we should reset the argument normalize toindex (when normalize is set to index, the sum of each row is 1):

pd.crosstab(index=df.cylinders, columns=df.model_year, margins=True, normalize='index').round(2)

Output:

model_year    70    71    72    73    74    75    76    77    78    79    80    81    82
cylinders                                                                      
3           0.00  0.00  0.25  0.25  0.00  0.00  0.00  0.25  0.00  0.00  0.25  0.00  0.00   
4           0.03  0.06  0.07  0.05  0.07  0.06  0.07  0.07  0.08  0.06  0.12  0.10  0.14 
5           0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.33  0.33  0.33  0.00  0.00 
6           0.05  0.10  0.00  0.10  0.08  0.14  0.12  0.06  0.14  0.07  0.02  0.08  0.04 
8           0.17  0.07  0.13  0.19  0.05  0.06  0.09  0.08  0.06  0.10  0.00  0.01  0.00 
All         0.07  0.07  0.07  0.10  0.07  0.08  0.09  0.07  0.09  0.07  0.07  0.07  0.08

The sum of the columns disappeared, but the sum of the rows remains. Look at the number in each cell. It indicates the engines with a specific cylinder number (a specific row) by a specific year of production.

If you need the opposite, set normalize='columns'.

You can also perform aggregations by specifying the aggfunc and the values parameters, but it's fully analogous to the way df.pivot_table() does it.

Conclusion

In today's topic, we have covered contingency tables, and how to create and adjust them. Here are some key points:

create contingency tables with crosstab();
add margins to the tables for convenience by using margins=True;
applying relative and absolute frequencies to contingency tables with normalize=True.

How did you like the theory?

Report a typo

Crosstab in pandas

Pandas.crosstab

Conclusion

Related topics