Computer scienceData scienceInstrumentsPandasData analysis with pandas

Reshaping and Pivot Tables

20 minutes read

Data reshaping can simplify the process of data computing/analyzing. It's very easy to turn categorical columns into rows and vice versa with Pandas.

Automatically collected data (for example, a log of user actions or an aircraft flight recorder) usually have a flat structure, also known as "stacked" or "long". It often has an ordinal index, and every column reflects an attribute of a described object/process. That's why the word "features" in Machine Learning is used. The columns contain feature values.
This data is easy to process, but hard to evaluate at first glance.

The structural difference between unstacked and stacked dataframes

Let's start with the very basic reshaping. When you need to quickly transform your data into a readable (and office-management favorite) format, use the .pivot() method.

DataFrame.pivot

Let's consider a simple example below. The Flights dataset shows how many people traveled by air from 1949 to 1960 by month. It can be imported from GitHub with one line (requires internet connection):

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/flights.csv')
df.head()

Output:

+----+--------+----------+--------------+
|    |   year | month    |   passengers |
|----+--------+----------+--------------|
|  0 |   1949 | January  |          112 |
|  1 |   1949 | February |          118 |
|  2 |   1949 | March    |          132 |
|  3 |   1949 | April    |          129 |
|  4 |   1949 | May      |          121 |
+----+--------+----------+--------------+

This dataset contains 144 rows. It's quite difficult to look at them at once. We can apply the following:

df.pivot(index='year', columns='month', values='passengers')

Output:

+--------+---------+----------+------------+------------+-----------+--------+--------+---------+-------+------------+-----------+-------------+
|  month |   April |   August |   December |   February |   January |   July |   June |   March |   May |   November |   October |   September |
|   year |         |          |            |            |           |        |        |         |       |            |           |             |
|--------+---------+----------+------------+------------+-----------+--------+--------+---------+-------+------------+-----------+-------------|
|   1949 |     129 |      148 |        118 |        118 |       112 |    148 |    135 |     132 |   121 |        104 |       119 |         136 |
|   1950 |     135 |      170 |        140 |        126 |       115 |    170 |    149 |     141 |   125 |        114 |       133 |         158 |
|   1951 |     163 |      199 |        166 |        150 |       145 |    199 |    178 |     178 |   172 |        146 |       162 |         184 |
|   1952 |     181 |      242 |        194 |        180 |       171 |    230 |    218 |     193 |   183 |        172 |       191 |         209 |
|   1953 |     235 |      272 |        201 |        196 |       196 |    264 |    243 |     236 |   229 |        180 |       211 |         237 |
|   1954 |     227 |      293 |        229 |        188 |       204 |    302 |    264 |     235 |   234 |        203 |       229 |         259 |
|   1955 |     269 |      347 |        278 |        233 |       242 |    364 |    315 |     267 |   270 |        237 |       274 |         312 |
|   1956 |     313 |      405 |        306 |        277 |       284 |    413 |    374 |     317 |   318 |        271 |       306 |         355 |
|   1957 |     348 |      467 |        336 |        301 |       315 |    465 |    422 |     356 |   355 |        305 |       347 |         404 |
|   1958 |     348 |      505 |        337 |        318 |       340 |    491 |    435 |     362 |   363 |        310 |       359 |         404 |
|   1959 |     396 |      559 |        405 |        342 |       360 |    548 |    472 |     406 |   420 |        362 |       407 |         463 |
|   1960 |     461 |      606 |        432 |        391 |       417 |    622 |    535 |     419 |   472 |        390 |       461 |         508 |
+--------+---------+----------+------------+------------+-----------+--------+--------+---------+-------+------------+-----------+-------------+

Now we can see them all together to get some insights! For example, summertime flights were popular from the beginning and their number grew faster.
As you can see there are a couple of changes:

  1. Index and columns now have names: "year" and "month".
  2. The "year" column is now the index.
  3. The "month" has been transformed into columns.
  4. Each passenger value is now located where a year and a month collide.

All further methods are Pandas functions. The first argument (the data argument) needs to be passed to the processed dataframe. It produces the same result:

pd.pivot(df, index='year', columns='month', values='passengers')

However, if you need to build a pivot table with aggregation, for example, when there are dimension duplicates for rows and columns (one index/columns pair corresponds to more than one value ), you should use .pivot_table ()

DataFrame.pivot_table

This method allows you to build pivot tables with data aggregation. If every index/columns pair corresponds to only one value then the results are similar to the .pivot() method. For example, we take a bit more complex dataset called "mpg" (miles per gallon) that contains cars from the past century and their specs.

df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv')
df.head()

Output:

+----+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+---------------------------+
|    |   mpg |   cylinders |   displacement |   horsepower |   weight |   acceleration |   model_year | origin   | name                      |
|----+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+---------------------------|
|  0 |    18 |           8 |            307 |          130 |     3504 |           12   |           70 | usa      | chevrolet chevelle malibu |
|  1 |    15 |           8 |            350 |          165 |     3693 |           11.5 |           70 | usa      | buick skylark 320         |
|  2 |    18 |           8 |            318 |          150 |     3436 |           11   |           70 | usa      | plymouth satellite        |
|  3 |    16 |           8 |            304 |          150 |     3433 |           12   |           70 | usa      | amc rebel sst             |
|  4 |    17 |           8 |            302 |          140 |     3449 |           10.5 |           70 | usa      | ford torino               |
+----+-------+-------------+----------------+--------------+----------+----------------+--------------+----------+---------------------------+

Suppose, we have to find the mean horsepower number of the cars presented by country and year. The .pivot_table() should contain an aggregator in the aggfunc argument. It finds the mean value by default. We need to specify the required values of the index and columns and we will round the results up to 1 decimal for neatness:

df.pivot_table(index='origin', columns='model_year', values='horsepower').round(1)

Output:

+----------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+
|model year|    70 |    71 |    72 |    73 |    74 |    75 |    76 |    77 |    78 |    79 |   80 |   81 |   82 |
|origin    |       |       |       |       |       |       |       |       |       |       |      |      |      |
|----------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------|
| europe   |  86.2 |  74   |  79.6 |  81.9 |  74.2 |  89.5 |  87.6 |  81   |  99.2 |  72   | 66.8 | 76.7 | 63   |
| japan    |  91.5 |  79.2 |  93.8 |  98.5 |  72.5 |  80.2 |  76.5 |  81.2 |  79.2 |  65   | 78.8 | 78.3 | 74   |
| usa      | 167   | 119.8 | 138.8 | 146.6 | 112.1 | 108.7 | 110.5 | 118.4 | 107.3 | 109.4 | 88.8 | 84.5 | 86.9 |
+----------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+

The changes are similar to the .pivot() case, except that at the intersection of an origin country and a model year, we got the horsepower mean for that year and country.

You may be wondering: "Why on earth do I need those pivot tables when I could do grouping and aggregations without them?!"
To show you why let's try to implement the following request: "In which year(s) the median horsepower of Japanese cars was less than of the European?" Without a pivot table, the solution could look like this:

japan = df.query("origin == 'japan'")\
          .groupby(['model_year'])\
          .agg({'horsepower':'median'})
merged_median_hp = df.query("origin == 'europe'")\
                     .groupby(['model_year'])\
                     .agg({'horsepower':'median'})\
                     .merge(japan, 
                            on='model_year', 
                            suffixes=['_europe', '_japan'])
merged_median_hp.loc[merged_median_hp.horsepower_europe > merged_median_hp.horsepower_japan]

Output:

+--------------+---------------------+--------------------+
|              |   horsepower_europe |   horsepower_japan |
|   model_year |                     |                    |
|--------------+---------------------+--------------------|
|           74 |                75   |               66   |
|           75 |                91.5 |               85.5 |
|           76 |                84.5 |               72.5 |
|           77 |                78   |               72.5 |
|           78 |               109   |               82.5 |
|           79 |                71   |               65   |
|           81 |                76   |               71   |
+--------------+---------------------+--------------------+

Here, a separate data frame is created with Japanese cars grouped by year, then a part of the original data frame with European machines grouped by year is taken, and merged with Japanese ones in another data frame. After that, the required lines are selected with .loc. But we could have done it with .pivot_table() instead:

pivot_median_hp = df.pivot_table(index='model_year', 
                               columns='origin', 
                               values='horsepower',
                               aggfunc='median')
pivot_median_hp.loc[pivot_median_hp.europe > pivot_median_hp.japan, 
                    ['europe', 'japan']]

Output:

+--------------+----------+---------+
|       origin |   europe |   japan |
|   model_year |          |         |
|--------------+----------+---------|
|           74 |     75   |    66   |
|           75 |     91.5 |    85.5 |
|           76 |     84.5 |    72.5 |
|           77 |     78   |    72.5 |
|           78 |    109   |    82.5 |
|           79 |     71   |    65   |
|           81 |     76   |    71   |
+--------------+----------+---------+

Pivot and locate. That's brilliant.

Tip: You can create multi-index and multi-level columns by passing lists with field names to the index and columns arguments.

DataFrame.melt

If our data is pivoted and we want to make it flat, we can use the .melt() method. Let's create a sample wide_df:

wide_df = df.pivot_table(index='origin', columns='model_year', values='horsepower').round(2)
wide_df.reset_index(inplace=True)

Output:

+----+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-------+-------+
|    | origin   |     70 |     71 |     72 |     73 |     74 |     75 |     76 |     77 |     78 |     79 |    80 |    81 |    82 |
|----+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-------+-------|
|  0 | europe   |  86.2  |  74    |  79.6  |  81.86 |  74.17 |  89.5  |  87.62 |  81    |  99.17 |  72    | 66.75 | 76.67 | 63    |
|  1 | japan    |  91.5  |  79.25 |  93.8  |  98.5  |  72.5  |  80.25 |  76.5  |  81.17 |  79.25 |  65    | 78.85 | 78.33 | 74    |
|  2 | usa      | 166.95 | 119.84 | 138.78 | 146.62 | 112.14 | 108.7  | 110.5  | 118.39 | 107.27 | 109.43 | 88.83 | 84.54 | 86.95 |
+----+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-------+-------+

Let's transform all "year" columns into one by calling .melt() with suitable parameters. For the id_vars argument, we set the column name that contains an identifier. In our case, it is the "origin". Further, we should set the value_vars argument to the list of columns. .melt() by default takes all other columns for value_vars , so we just omit this argument. Then output the first 10 rows of the resulting dataframe:

wide_df.melt(id_vars='origin').head(10)

Output:

+----+----------+--------------+---------+
|    | origin   |   model_year |   value |
|----+----------+--------------+---------|
|  0 | europe   |           70 |   86.2  |
|  1 | japan    |           70 |   91.5  |
|  2 | usa      |           70 |  166.95 |
|  3 | europe   |           71 |   74    |
|  4 | japan    |           71 |   79.25 |
|  5 | usa      |           71 |  119.84 |
|  6 | europe   |           72 |   79.6  |
|  7 | japan    |           72 |   93.8  |
|  8 | usa      |           72 |  138.78 |
|  9 | europe   |           73 |   81.86 |
+----+----------+--------------+---------+

If we want to take, for example, only the last three columns from the original to reshape them in one, we pass their names to value_vars:

wide_df.melt(id_vars='origin', value_vars=wide_df.columns[-3:])

Output:

+----+----------+--------------+---------+
|    | origin   |   model_year |   value |
|----+----------+--------------+---------|
|  0 | europe   |           80 |   66.75 |
|  1 | japan    |           80 |   78.85 |
|  2 | usa      |           80 |   88.83 |
|  3 | europe   |           81 |   76.67 |
|  4 | japan    |           81 |   78.33 |
|  5 | usa      |           81 |   84.54 |
|  6 | europe   |           82 |   63    |
|  7 | japan    |           82 |   74    |
|  8 | usa      |           82 |   86.95 |
+----+----------+--------------+---------+

We can use .melt() for collecting and storing the data from manually inputted tables. Spreadsheets with "wide format" are easier to fill and they can be immediately evaluated by sight, but for analyzing that data with Pandas, it is better to reshape it.

Summary

As a result, now you know how to:

  • reshape and evaluate data with pivot tables with .pivot() and .pivot_table(),
  • transform data back to "stacked" format with .melt().

You can quickly evaluate big tables with categorical data, find mutual feature frequencies, and transform manually inputted tables to a format that is easy to compute. Every time you get the feeling that you code too much to answer a simple question you should ask yourself: "Maybe the shape of my data is fuzzy and I need to change it?"

55 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo