Computer scienceData scienceInstrumentsPandasData analysis with pandas

MultiIndex dataframes

5 minutes read

Indexes in pandas DataFrame allow for selecting, filtering, and aggregating data. DataFrames are typically two-dimensional structures, but if the data has a complex structure with multiple levels of categories for each row, multi-index can be used to organize data operations.

This topic covers methods for creating the multi-index and performing operations on it.

The initial steps

Multi-index, also known as the hierarchical index, creates structured indexes consisting of multiple levels. This approach enables the storage and manipulation of data with varying dimensions — where the values of some columns are subcategories of another column's values.

Hierarchical indexes can be used when there are multiple records for the same date, or when there are multiple landmarks within each location. Let's look at two examples that show the usage of hierarchical indexes for dates and locations.

Example with time:

import pandas as pd

df_time = pd.DataFrame(
    {
        "Date": ["2023-08-01", "2023-08-01", "2023-08-02", "2023-08-02"],
        "Product": ["A", "B", "A", "B"],
        "Store": ["Store 1", "Store 1", "Store 2", "Store 2"],
        "Sales": [100, 150, 200, 120],
    }
)

multi_index = pd.MultiIndex.from_frame(df_time[["Product", "Store"]])
df_time.set_index(multi_index, inplace=True)
df_time.drop(["Product", "Store"], axis=1, inplace=True)

Here, each row label holds 2 values: 'Product' and 'Store' (products belonging to different stores). Each unique combination of 'Product' and 'Store' identifies the row:

                Date	        Sales
Product	Store		
A	Store 1	2023-08-01	100
B	Store 1	2023-08-01	150
A	Store 2	2023-08-02	200
B	Store 2	2023-08-02	120

Example with location:

df_location = pd.DataFrame(
    {
        "City": [
            "Paris",
            "Paris",
            "Madrid",
            "Madrid",
            "Rome",
            "Rome",
            "Berlin",
            "Berlin",
        ],
        "Landmark": [
            "Eiffel Tower",
            "Louvre Museum",
            "Prado Museum",
            "Puerta del Sol Square",
            "Colosseum",
            "Pantheon",
            "Brandenburg Gate",
            "Berlin History Museum",
        ],
        "Visits": [1000000, 800000, 600000, 400000, 1200000, 1000000, 700000, 500000],
    }
)

multi_index = pd.MultiIndex.from_frame(df_location[["City", "Landmark"]])
df_location.set_index(multi_index, inplace=True)
df_location.drop(["City", "Landmark"], axis=1, inplace=True)

In this case, there are multiple landmarks within each location (e.g., both the Eiffel Tower and Louvre are in Paris), and 'City' and 'Landmark' make the multi-index:

		                Visits
City	Landmark	
Paris	Eiffel Tower	        1000000
        Louvre Museum	        800000
Madrid	Prado Museum	        600000
        Puerta del Sol Square	400000
Rome	Colosseum	        1200000
        Pantheon	        1000000
Berlin	Brandenburg Gate	700000
        Berlin History Museum	500000

Put simply, the multi-index is a way to store multiple labels (or levels) within each index.

Creating a multi-index

There are several methods for creating multi-indexes in pandas. The choice of the method depends on the present data structure.

Array-like structures are collections of data that can be organized as one-dimensional or multidimensional arrays (or similar structures) with element access by index. For example, python lists, NumPy arrays, and pd.Series are all considered to be array-like structures.

We can use several methods to create a multi-index: MultiIndex.from_arrays(), MultiIndex.from_tuples(), MultiIndex.from_product(), MultiIndex.from_frame().

The following data will be used to demonstrate the examples:

list_places = [
    ["Paris", "Paris", "Madrid", "Madrid", "Rome", "Rome", "Berlin", "Berlin"],
    [
        "Eiffel Tower",
        "Louvre Museum",
        "Prado Museum",
        "Puerta del Sol Square",
        "Colosseum",
        "Pantheon",
        "Brandenburg Gate",
        "Berlin History Museum",
    ],
]

visits = [1000000, 800000, 600000, 400000, 1200000, 1000000, 700000, 500000]
  • MultiIndex.from_arrays() creates a multi-index from lists of values for each level.

index = pd.MultiIndex.from_arrays(list_places, names=("City", "Landmark"))
dataframe = pd.DataFrame({"Visits": visits}, index=index)
		                Visits
City	Landmark	
Paris	Eiffel Tower	        1000000
        Louvre Museum	        800000
Madrid	Prado Museum	        600000
        Puerta del Sol Square	400000
Rome	Colosseum	        1200000
        Pantheon	        1000000
Berlin	Brandenburg Gate	700000
        Berlin History Museum	500000
  • MultiIndex.from_tuples()creates a multi-index from a list of tuples.

tuples = list(zip(*list_places))
index = pd.MultiIndex.from_tuples(tuples, names=["City", "Landmark"])
dataframe = pd.DataFrame({"Visits": visits}, index=index)
		                Visits
City	Landmark	
Paris	Eiffel Tower	        1000000
        Louvre Museum	        800000
Madrid	Prado Museum	        600000
        Puerta del Sol Square	400000
Rome	Colosseum	        1200000
        Pantheon	        1000000
Berlin	Brandenburg Gate	700000
        Berlin History Museum	500000
  • MultiIndex.from_frame() creates a multi-index from the DataFrame.

index = pd.MultiIndex.from_frame(
    pd.DataFrame({"City": list_places[0], "Landmark": list_places[1]})
)
dataframe = pd.DataFrame({"Visits": visits}, index=index)
		                Visits
City	Landmark	
Paris	Eiffel Tower	        1000000
        Louvre Museum	        800000
Madrid	Prado Museum	        600000
        Puerta del Sol Square	400000
Rome	Colosseum	        1200000
        Pantheon	        1000000
Berlin	Brandenburg Gate	700000
        Berlin History Museum	500000
  • MultiIndex.from_product() creates a multi-index from the cartesian product of multiple iterables.

.from_product() creates all possible combinations of values from the given sequences. We use a smaller amount of data for clarity:

index = pd.MultiIndex.from_product(
    [["Paris", "Madrid"], ["Eiffel Tower", "Prado Museum"]], names=["City", "Place"]
)
dataframe = pd.DataFrame({"Visits": [1000000, 600000, 1000000, 600000]}, index=index)
		        Visits
City	Place	
Paris	Eiffel Tower    1000000
        Prado Museum	600000
Madrid	Eiffel Tower	1000000
        Prado Museum	600000

After creating the multi-index, let's look at the methods for working with them.

Retrieving the values

Multi-index allows you to retrieve values at a specific index level. We can access the names of the indexes with the .names attribute of the index:

dataframe.index.names

And we get FrozenList(['City', 'Place']), where 'City' level has an index position of 0, and 'Place' level has an index position of 1.

  • .get_level_values() method returns a vector of labels with a location at each level. You can pass either the index position or the level name as a string.

Let's take the index variable from the previous example and access the 0th position:

index.get_level_values(0)

Which is equivalent to

index.get_level_values('City')

Output:

Index(['Paris', 'Paris', 'Madrid', 'Madrid'], dtype='object', name='City')

Similarly, we can access the 'Place' level, indexed as 1:

index.get_level_values(1) # ~ index.get_level_values('Place')

Output:

Index(['Eiffel Tower', 'Prado Museum', 'Eiffel Tower', 'Prado Museum'], dtype='object', name='Place')
  • remove_unused_levels() allows you to remove unused levels. In this example, the first two levels have been removed.

index[2:].remove_unused_levels()

Output:

MultiIndex([('Madrid', 'Eiffel Tower'),
            ('Madrid', 'Prado Museum')],
           names=['City', 'Place'])

Index operations

The multi-index can be manipulated via the .set_index() and the .reset_index() methods.

  • .reset_index() reset the multi-index to a regular pandas index:

dataframe_reset_index = dataframe.reset_index()
     City         Place   Visits
0   Paris  Eiffel Tower  1000000
1   Paris  Prado Museum   600000
2  Madrid  Eiffel Tower  1000000
3  Madrid  Prado Museum   600000
  • .set_index() sets an index using one or more DataFrame columns:

dataframe_reset_index.set_index(['City','Place'])
                      Visits
City   Place                
Paris  Eiffel Tower  1000000
       Prado Museum   600000
Madrid Eiffel Tower  1000000
       Prado Museum   600000

Performing basic queries

Multi-index DataFrames can be queried like the regular-index DataFrames.

.query() calls columns using a boolean expression. A query can be with multiple indexes and multiple conditions.

index = pd.MultiIndex.from_frame(pd.DataFrame({'City':list_places[0], 'Landmark': list_places[1]}))
dataframe = pd.DataFrame({'Visits': visits}, index=index) # re-establish one of the examples

dataframe.query("City == 'Madrid' and Landmark == 'Puerta del Sol Square'")
		                Visits
City	Landmark	
Madrid	Puerta del Sol Square	400000

A similar operation can be done using .loc() or .iloc().

.loc() is a method that enables you to retrieve data in a DataFrame using labels for both rows and columns. .iloc(), on the other hand, allows extraction by index position. Both of them can be used to extract the data from a multi-index DataFrame. Let's look at the .loc example:

dataframe.loc[
    (dataframe["Visits"] > 800000)
    & (dataframe.index.get_level_values("City") != "Paris")
]

Output:

                        Visits
City	Landmark	
Rome	Colosseum	1200000
        Pantheon	1000000

Request execution time

Let's measure the execution time of queries to the DataFrame with and without a multi-index (note that the execution time might vary, depending on the hardware):

import numpy as np
import timeit

index = pd.MultiIndex.from_tuples([(i, j) for i in range(1000) for j in range(1000)])
data = pd.DataFrame(np.random.rand(1000000, 2), index=index)


def query_data():
    return data.loc[(500,)]


# Query execution time with multi-index
start_time_multi = timeit.default_timer()
query_data()
end_time_multi = timeit.default_timer()
time_multi = end_time_multi - start_time_multi

data_reset = data.reset_index()

# Query execution time without multi-index
start_time_reset = timeit.default_timer()
data_reset.loc[data_reset[0] == 500]
end_time_reset = timeit.default_timer()
time_reset = end_time_reset - start_time_reset

print("Query execution time with multi-index:", time_multi)
print("Query execution time without multi-index:", time_reset)
Query execution time with multi-index: 0.005945717000031436
Query execution time without multi-index: 0.002980844000376237

Multi-index doesn't inherently provide more efficient querying compared to single-index. In general, the main benefits of a multi-index are organizational and conceptual rather than performance-related. They are useful for representing hierarchical data structures more intuitively.

Conclusion

Multi-indexes and related methods provide a set of tools for processing complex data and allow you to analyze, filter, and organize information more efficiently, adapting to specific requirements and data structures. As a result, you are now familiar with the following:

  • Multi-index provides a tool to organize data with multiple levels of indexing, simplifying work with nested data structures.

  • Methods for creating the multi-index include .from_tuples(), .from_arrays(), .from_frame(), and .from_product(), allowing you to choose the method to create a multi-index based on the underlying data structure.

  • .get_level_values() helps to extract the values at a specific index level and .remove_unused_levels() removes the unused levels.

  • The multi-index can be manipulated with .set_index(), which sets columns as the index, and .reset_index(), which converts the index into columns.

  • .query() , .loc() and .iloc() provide methods for data selection. .query() works based on string expressions, .loc() allows flexible access to data using index labels, and .iloc() gives access based on the index positions.

How did you like the theory?
Report a typo