Pandas.crosstab() computes a frequency table, which shows the number of observations for each variable combination for two or more variables. It shows the distribution of observations across different categories and the relationship between two categorical variables.
Pandas.crosstab
The functionality of the crosstab() function is very similar to the .pivot_table() method. If you look into its source code, you can see the .pivot_table() there. The two main differences are:
- By default, if you don't pass the values of parameters
valuesandaggfunc, thecrosstab()builds a frequency (aka a contingency) table. crosstab()is more suitable for creating a table from various indexed iterables, while.pivot_table()is for creating a table from the existing dataframes.
Let's see how it works with the Auto-mpg dataset.
import pandas as pd
df = pd.read_csv("auto-mpg.csv")
The first rows are as follows:
+----+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+---------------------------+
| | mpg | cylinders | displacement | horsepower | weight | acceleration | model year | origin | car name |
|----+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+---------------------------|
| 0 | 18 | 8 | 307 | 130 | 3504 | 12 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18 | 8 | 318 | 150 | 3436 | 11 | 70 | 1 | plymouth satellite |
| 3 | 16 | 8 | 304 | 150 | 3433 | 12 | 70 | 1 | amc rebel sst |
| 4 | 17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
+----+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+---------------------------+
In general, crosstab() not necessarily feeds the Pandas series. It can compute any explicitly indexed sequences (e.g. Python dictionaries or NumPy arrays, or any array-like types).
Let's find how many times in what year with how many cylinders the engines were used. First, index corresponds to the values to group by in the rows, and columns are the values to group by in the columns of the future frequency table. crosstab() returns a new dataframe, we can save it in a new variable and access the individual elements with either .loc or .iloc as:
ct = pd.crosstab(index=df.cylinders, columns=df.model_year)
value_iloc = cross.iloc[3, 0] # we access the value of the 4th row and the 1st column
value_loc = cross.loc[6, 70] # we access the same element - but 6 corresponds to the 'cylinders' and 70 corresponds to the 'model_year'
print(ct)
print(f'cylinders = 6; model_year = 70 with iloc: {value_iloc}')
print(f'cylinders = 6; model_year = 70 with loc: {value_loc}')
Output:
model_year 70 71 72 73 74 75 76 77 78 79 80 81 82
cylinders
3 0 0 1 1 0 0 0 1 0 0 1 0 0
4 7 13 14 11 15 12 15 14 17 12 25 21 28
5 0 0 0 0 0 0 0 0 1 1 1 0 0
6 4 8 0 8 7 12 10 5 12 6 2 7 3
8 18 7 13 20 5 6 9 8 6 10 0 1 0
cylinders = 6; model_year = 70 with iloc: 4
cylinders = 6; model_year = 70 with loc: 4
Now it's easy to find that the production of the 6-cylinder engines was very popular between 1975 and 1978. For the 8-cylinder — between 1970 and 1973.
Achieving the same result with .pivot_table() only is going to be a tall order:
df.pivot_table(index = 'cylinders',
columns='model_year',
values='car_name',
aggfunc='count').fillna(0).astype(int)
Output:
model_year 70 71 72 73 74 75 76 77 78 79 80 81 82
cylinders
3 0 0 1 1 0 0 0 1 0 0 1 0 0
4 7 13 14 11 15 12 15 14 17 12 25 21 28
5 0 0 0 0 0 0 0 0 1 1 1 0 0
6 4 8 0 8 7 12 10 5 12 6 2 7 3
8 18 7 13 20 5 6 9 8 6 10 0 1 0
A few words about the parameters. When analyzing a cross table, it is convenient to see the total by columns and rows. For this, the argument margins=True is used and the default totals are marked as All:
pd.crosstab(index=df.cylinders, columns=df.model_year, margins=True)
Output:
model_year 70 71 72 73 74 75 76 77 78 79 80 81 82 All
cylinders
3 0 0 1 1 0 0 0 1 0 0 1 0 0 4
4 7 13 14 11 15 12 15 14 17 12 25 21 28 204
5 0 0 0 0 0 0 0 0 1 1 1 0 0 3
6 4 8 0 8 7 12 10 5 12 6 2 7 3 84
8 18 7 13 20 5 6 9 8 6 10 0 1 0 103
All 29 28 28 40 27 30 34 28 36 29 29 29 31 398
Also, it often happens that you need to know not the absolute, but the relative frequency of objects (in fractions of the total number). Add the argument normalize=True and round to the second decimal place:
pd.crosstab(index=df.cylinders, columns=df.model_year, margins=True, normalize=True).round(2)
Output:
model_year 70 71 72 73 74 75 76 77 78 79 80 81 82 All
cylinders
3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01
4 0.02 0.03 0.04 0.03 0.04 0.03 0.04 0.04 0.04 0.03 0.06 0.05 0.07 0.51
5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01
6 0.01 0.02 0.00 0.02 0.02 0.03 0.03 0.01 0.03 0.02 0.01 0.02 0.01 0.21
8 0.05 0.02 0.03 0.05 0.01 0.02 0.02 0.02 0.02 0.03 0.00 0.00 0.00 0.26
All 0.07 0.07 0.07 0.10 0.07 0.08 0.09 0.07 0.09 0.07 0.07 0.07 0.08 1.00
As you can see from the bottom-right cell, 100% here is the total number of engines by all years and cylinders. It's not very informative. If we explore the cylinder number by production years, it'll be much more interesting to see the part from the total output of each particular engine recorded for all years. To show the total for each index value (each number of cylinders), we should reset the argument normalize toindex (when normalize is set to index, the sum of each row is 1):
pd.crosstab(index=df.cylinders, columns=df.model_year, margins=True, normalize='index').round(2)
Output:
model_year 70 71 72 73 74 75 76 77 78 79 80 81 82
cylinders
3 0.00 0.00 0.25 0.25 0.00 0.00 0.00 0.25 0.00 0.00 0.25 0.00 0.00
4 0.03 0.06 0.07 0.05 0.07 0.06 0.07 0.07 0.08 0.06 0.12 0.10 0.14
5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00
6 0.05 0.10 0.00 0.10 0.08 0.14 0.12 0.06 0.14 0.07 0.02 0.08 0.04
8 0.17 0.07 0.13 0.19 0.05 0.06 0.09 0.08 0.06 0.10 0.00 0.01 0.00
All 0.07 0.07 0.07 0.10 0.07 0.08 0.09 0.07 0.09 0.07 0.07 0.07 0.08
The sum of the columns disappeared, but the sum of the rows remains. Look at the number in each cell. It indicates the engines with a specific cylinder number (a specific row) by a specific year of production.
normalize='columns'.You can also perform aggregations by specifying the aggfunc and the values parameters, but it's fully analogous to the way df.pivot_table() does it.
Conclusion
In today's topic, we have covered contingency tables, and how to create and adjust them. Here are some key points:
- create contingency tables with
crosstab(); - add margins to the tables for convenience by using
margins=True; - applying relative and absolute frequencies to contingency tables with
normalize=True.