Indexes in pandas DataFrame allow for selecting, filtering, and aggregating data. DataFrames are typically two-dimensional structures, but if the data has a complex structure with multiple levels of categories for each row, multi-index can be used to organize data operations.
This topic covers methods for creating the multi-index and performing operations on it.
The initial steps
Multi-index, also known as the hierarchical index, creates structured indexes consisting of multiple levels. This approach enables the storage and manipulation of data with varying dimensions — where the values of some columns are subcategories of another column's values.
Hierarchical indexes can be used when there are multiple records for the same date, or when there are multiple landmarks within each location. Let's look at two examples that show the usage of hierarchical indexes for dates and locations.
Example with time:
import pandas as pd
df_time = pd.DataFrame(
{
"Date": ["2023-08-01", "2023-08-01", "2023-08-02", "2023-08-02"],
"Product": ["A", "B", "A", "B"],
"Store": ["Store 1", "Store 1", "Store 2", "Store 2"],
"Sales": [100, 150, 200, 120],
}
)
multi_index = pd.MultiIndex.from_frame(df_time[["Product", "Store"]])
df_time.set_index(multi_index, inplace=True)
df_time.drop(["Product", "Store"], axis=1, inplace=True)Here, each row label holds 2 values: 'Product' and 'Store' (products belonging to different stores). Each unique combination of 'Product' and 'Store' identifies the row:
Date Sales
Product Store
A Store 1 2023-08-01 100
B Store 1 2023-08-01 150
A Store 2 2023-08-02 200
B Store 2 2023-08-02 120Example with location:
df_location = pd.DataFrame(
{
"City": [
"Paris",
"Paris",
"Madrid",
"Madrid",
"Rome",
"Rome",
"Berlin",
"Berlin",
],
"Landmark": [
"Eiffel Tower",
"Louvre Museum",
"Prado Museum",
"Puerta del Sol Square",
"Colosseum",
"Pantheon",
"Brandenburg Gate",
"Berlin History Museum",
],
"Visits": [1000000, 800000, 600000, 400000, 1200000, 1000000, 700000, 500000],
}
)
multi_index = pd.MultiIndex.from_frame(df_location[["City", "Landmark"]])
df_location.set_index(multi_index, inplace=True)
df_location.drop(["City", "Landmark"], axis=1, inplace=True)In this case, there are multiple landmarks within each location (e.g., both the Eiffel Tower and Louvre are in Paris), and 'City' and 'Landmark' make the multi-index:
Visits
City Landmark
Paris Eiffel Tower 1000000
Louvre Museum 800000
Madrid Prado Museum 600000
Puerta del Sol Square 400000
Rome Colosseum 1200000
Pantheon 1000000
Berlin Brandenburg Gate 700000
Berlin History Museum 500000Put simply, the multi-index is a way to store multiple labels (or levels) within each index.
Creating a multi-index
There are several methods for creating multi-indexes in pandas. The choice of the method depends on the present data structure.
Array-like structures are collections of data that can be organized as one-dimensional or multidimensional arrays (or similar structures) with element access by index. For example, python lists, NumPy arrays, and pd.Series are all considered to be array-like structures.
We can use several methods to create a multi-index: MultiIndex.from_arrays(), MultiIndex.from_tuples(), MultiIndex.from_product(), MultiIndex.from_frame().
The following data will be used to demonstrate the examples:
list_places = [
["Paris", "Paris", "Madrid", "Madrid", "Rome", "Rome", "Berlin", "Berlin"],
[
"Eiffel Tower",
"Louvre Museum",
"Prado Museum",
"Puerta del Sol Square",
"Colosseum",
"Pantheon",
"Brandenburg Gate",
"Berlin History Museum",
],
]
visits = [1000000, 800000, 600000, 400000, 1200000, 1000000, 700000, 500000]MultiIndex.from_arrays()creates a multi-index from lists of values for each level.
index = pd.MultiIndex.from_arrays(list_places, names=("City", "Landmark"))
dataframe = pd.DataFrame({"Visits": visits}, index=index) Visits
City Landmark
Paris Eiffel Tower 1000000
Louvre Museum 800000
Madrid Prado Museum 600000
Puerta del Sol Square 400000
Rome Colosseum 1200000
Pantheon 1000000
Berlin Brandenburg Gate 700000
Berlin History Museum 500000MultiIndex.from_tuples()creates a multi-index from a list of tuples.
tuples = list(zip(*list_places))
index = pd.MultiIndex.from_tuples(tuples, names=["City", "Landmark"])
dataframe = pd.DataFrame({"Visits": visits}, index=index) Visits
City Landmark
Paris Eiffel Tower 1000000
Louvre Museum 800000
Madrid Prado Museum 600000
Puerta del Sol Square 400000
Rome Colosseum 1200000
Pantheon 1000000
Berlin Brandenburg Gate 700000
Berlin History Museum 500000MultiIndex.from_frame()creates a multi-index from theDataFrame.
index = pd.MultiIndex.from_frame(
pd.DataFrame({"City": list_places[0], "Landmark": list_places[1]})
)
dataframe = pd.DataFrame({"Visits": visits}, index=index) Visits
City Landmark
Paris Eiffel Tower 1000000
Louvre Museum 800000
Madrid Prado Museum 600000
Puerta del Sol Square 400000
Rome Colosseum 1200000
Pantheon 1000000
Berlin Brandenburg Gate 700000
Berlin History Museum 500000MultiIndex.from_product()creates a multi-index from the cartesian product of multiple iterables.
.from_product() creates all possible combinations of values from the given sequences. We use a smaller amount of data for clarity:
index = pd.MultiIndex.from_product(
[["Paris", "Madrid"], ["Eiffel Tower", "Prado Museum"]], names=["City", "Place"]
)
dataframe = pd.DataFrame({"Visits": [1000000, 600000, 1000000, 600000]}, index=index) Visits
City Place
Paris Eiffel Tower 1000000
Prado Museum 600000
Madrid Eiffel Tower 1000000
Prado Museum 600000After creating the multi-index, let's look at the methods for working with them.
Retrieving the values
Multi-index allows you to retrieve values at a specific index level. We can access the names of the indexes with the .names attribute of the index:
dataframe.index.namesAnd we get FrozenList(['City', 'Place']), where 'City' level has an index position of 0, and 'Place' level has an index position of 1.
.get_level_values()method returns a vector of labels with a location at each level. You can pass either the index position or the level name as a string.
Let's take the index variable from the previous example and access the 0th position:
index.get_level_values(0)Which is equivalent to
index.get_level_values('City')Output:
Index(['Paris', 'Paris', 'Madrid', 'Madrid'], dtype='object', name='City')Similarly, we can access the 'Place' level, indexed as 1:
index.get_level_values(1) # ~ index.get_level_values('Place')Output:
Index(['Eiffel Tower', 'Prado Museum', 'Eiffel Tower', 'Prado Museum'], dtype='object', name='Place')remove_unused_levels()allows you to remove unused levels. In this example, the first two levels have been removed.
index[2:].remove_unused_levels()Output:
MultiIndex([('Madrid', 'Eiffel Tower'),
('Madrid', 'Prado Museum')],
names=['City', 'Place'])Index operations
The multi-index can be manipulated via the .set_index() and the .reset_index() methods.
.reset_index()reset the multi-index to a regular pandas index:
dataframe_reset_index = dataframe.reset_index() City Place Visits
0 Paris Eiffel Tower 1000000
1 Paris Prado Museum 600000
2 Madrid Eiffel Tower 1000000
3 Madrid Prado Museum 600000.set_index()sets an index using one or moreDataFramecolumns:
dataframe_reset_index.set_index(['City','Place']) Visits
City Place
Paris Eiffel Tower 1000000
Prado Museum 600000
Madrid Eiffel Tower 1000000
Prado Museum 600000Performing basic queries
Multi-index DataFrames can be queried like the regular-index DataFrames.
.query() calls columns using a boolean expression. A query can be with multiple indexes and multiple conditions.
index = pd.MultiIndex.from_frame(pd.DataFrame({'City':list_places[0], 'Landmark': list_places[1]}))
dataframe = pd.DataFrame({'Visits': visits}, index=index) # re-establish one of the examples
dataframe.query("City == 'Madrid' and Landmark == 'Puerta del Sol Square'") Visits
City Landmark
Madrid Puerta del Sol Square 400000A similar operation can be done using .loc() or .iloc().
.loc() is a method that enables you to retrieve data in a DataFrame using labels for both rows and columns. .iloc(), on the other hand, allows extraction by index position. Both of them can be used to extract the data from a multi-index DataFrame. Let's look at the .loc example:
dataframe.loc[
(dataframe["Visits"] > 800000)
& (dataframe.index.get_level_values("City") != "Paris")
]Output:
Visits
City Landmark
Rome Colosseum 1200000
Pantheon 1000000Request execution time
Let's measure the execution time of queries to the DataFrame with and without a multi-index (note that the execution time might vary, depending on the hardware):
import numpy as np
import timeit
index = pd.MultiIndex.from_tuples([(i, j) for i in range(1000) for j in range(1000)])
data = pd.DataFrame(np.random.rand(1000000, 2), index=index)
def query_data():
return data.loc[(500,)]
# Query execution time with multi-index
start_time_multi = timeit.default_timer()
query_data()
end_time_multi = timeit.default_timer()
time_multi = end_time_multi - start_time_multi
data_reset = data.reset_index()
# Query execution time without multi-index
start_time_reset = timeit.default_timer()
data_reset.loc[data_reset[0] == 500]
end_time_reset = timeit.default_timer()
time_reset = end_time_reset - start_time_reset
print("Query execution time with multi-index:", time_multi)
print("Query execution time without multi-index:", time_reset)Query execution time with multi-index: 0.005945717000031436
Query execution time without multi-index: 0.002980844000376237Multi-index doesn't inherently provide more efficient querying compared to single-index. In general, the main benefits of a multi-index are organizational and conceptual rather than performance-related. They are useful for representing hierarchical data structures more intuitively.
Conclusion
Multi-indexes and related methods provide a set of tools for processing complex data and allow you to analyze, filter, and organize information more efficiently, adapting to specific requirements and data structures. As a result, you are now familiar with the following:
Multi-index provides a tool to organize data with multiple levels of indexing, simplifying work with nested data structures.
Methods for creating the multi-index include
.from_tuples(),.from_arrays(),.from_frame(), and.from_product(), allowing you to choose the method to create a multi-index based on the underlying data structure..get_level_values()helps to extract the values at a specific index level and.remove_unused_levels()removes the unused levels.The multi-index can be manipulated with
.set_index(), which sets columns as the index, and.reset_index(), which converts the index into columns..query(),.loc()and.iloc()provide methods for data selection..query()works based on string expressions,.loc()allows flexible access to data using index labels, and.iloc()gives access based on the index positions.