Computer scienceData scienceInstrumentsPandasStoring data with pandas

Combining Data in Pandas

17 minutes read

As you already know, there are two main data structures in pandas: one-dimensional Series and two-dimensional DataFrame. Earlier, you learned to create objects of both types. You also learned that you can transfer data from different files and tables to DataFrames. Now, imagine that you have to work with multiple similar or related datasets and process them in the same way. What's the first thing you could do to ease this task? Perhaps, you would like to combine them. Luckily for you, pandas suggests several ways to do it. In this topic, you will learn how to join DataFrame and Series objects using the concat() and merge() functions.

As usual, don't forget to import pandas!

import pandas as pd

Concatenating objects

The concat() function is used to concatenate or glue multiple objects together along a horizontal or vertical axis. To do so, we need to pass multiple Series or DataFrame objects as arguments to the function. But first, we will create two DataFrame instances — the tables that store students' names and their results after running distances of 100 meters and 2 kilometers respectively:

junior_class = pd.DataFrame({'Name': ['Ann', 'Kate', 'George', 'Eric'],
                             '100m (sec.)': ['16.3', '17.1', '14.8', '14.3'],
                             '2km (min., sec.)': ['9,24', '9,45', '9,17', '8,14']},
                            index=[1, 2, 3, 4])
senior_class = pd.DataFrame({'Name': ['Jack', 'Alicia', 'Ella', 'James'],
                             '100m (sec.)': ['15.9', '17.8', '17.0', '15.0'],
                             '2km (min., sec.)': ['8,18', '9,02', '8,58', '7,58']})

Mind that their indexes are different and independent of each other: the first DataFrame is indexed in the range 1-4 and the second one has the default 0, 1, 2, ... . It is done for illustration purposes: indexing is important in concatenation, so we will show how such differences affect the results and how you can reset the initial values.

Now, we should pass the DataFrames to concat() as a sequence or mapping:

pd.concat([junior_class, senior_class])

#      Name 100m (sec.) 2km (min., sec.)
# 1     Ann        16.3             9,24
# 2    Kate        17.1             9,45
# 3  George        14.8             9,17
# 4    Eric        14.3             8,14
# 0    Jack        15.9             8,18
# 1  Alicia        17.8             9,02
# 2    Ella        17.0             8,58
# 3   James        15.0             7,58

As you could notice, the second DataFrame was simply added after the first one. What if you want to add them alongside each other? Also, all the original indexes remained. While it makes sense when dealing with meaningful indexes (for instance, years), in our case we would probably like to recalculate them. These and some other issues can be solved by adjusting the following parameters:

  • axis — the axis along which to concatenate. Possible values are 0 and 1 (default '0'): axis=0 stands for combining along rows, and axis=1 is for combining along columns. Let's now set the value of axis to 1:

    pd.concat([junior_class, senior_class], axis=1)
    
    #      Name 100m (sec.) 2km (min., sec.)    Name 100m (sec.) 2km (min., sec.)
    # 0     NaN         NaN              NaN    Jack        15.9             8,18
    # 1     Ann        16.3             9,24  Alicia        17.8             9,02
    # 2    Kate        17.1             9,45    Ella        17.0             8,58
    # 3  George        14.8             9,17   James        15.0             7,58
    # 4    Eric        14.3             8,14     NaN         NaN              NaN

    Did you notice that some fields are filled with NaN values? NaN stands for 'Not a number'. That's the way of handling missing values in pandas: as we already noticed, junior_class and senior_class are indexed differently. So, every time there are such mismatches, unmatched fields of data will be filled with NaNs.

  • ignore_index — keeping or resetting the original indexes when concatenating. Possible values are 'True' and 'False' (default 'False'). If you want your resultant object to be reordered, specify ignore_index=True. Now, the axis will be labeled with numeric indexes starting with 0:

    pd.concat([junior_class, senior_class], ignore_index=True)
    
    #      Name 100m (sec.) 2km (min., sec.)
    # 0     Ann        16.3             9,24
    # 1    Kate        17.1             9,45
    # 2  George        14.8             9,17
    # 3    Eric        14.3             8,14
    # 4    Jack        15.9             8,18
    # 5  Alicia        17.8             9,02
    # 6    Ella        17.0             8,58
    # 7   James        15.0             7,58
  • join — combining with either the 'outer' or 'inner' type of joining (default 'outer'). The outer type of joining returns the union of all objects, that is, all their original rows will be preserved. On the contrary, the inner type includes only the rows that are labeled with indexes present in both datasets, excluding all other rows. Take a look at the following example: rows marked by numbers 1, 2, and 3 are present, but rows marked with 0 (from senior_class) and 4 (from junior_class) are eliminated:

    pd.concat([junior_class, senior_class], axis=1, join='inner')
    
    #      Name 100m (sec.) 2km (min., sec.)    Name 100m (sec.) 2km (min., sec.)
    # 1     Ann        16.3             9,24  Alicia        17.8             9,02
    # 2    Kate        17.1             9,45    Ella        17.0             8,58
    # 3  George        14.8             9,17   James        15.0             7,58
  • keys — adding a new level of labels to indicate, for example, from which Series or DataFrame the values come from or group them in some other way. The names should be passed as a sequence, for example, list:

    pd.concat([junior_class, senior_class], keys=['Jun. class', 'Sen. class'])
    
    #                 Name 100m (sec.) 2km (min., sec.)
    # Jun. class 1     Ann        16.3             9,24
    #            2    Kate        17.1             9,45
    #            3  George        14.8             9,17
    #            4    Eric        14.3             8,14
    # Sen. class 0    Jack        15.9             8,18
    #            1  Alicia        17.8             9,02
    #            2    Ella        17.0             8,58
    #            3   James        15.0             7,58

Merging objects

Compared to concat(), merge() is a more flexible tool for combining that provides possibilities to go a little deeper into the structure of objects. The function is rooted in the idea of the so-called database-style joining — joining on the basis of shared columns.

If you are familiar with SQL and SQL types of joins, the principal functionality of merge() will remind you of the following operations: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN.

The function needs two objects to operate on:

  • a DataFrame to join another object to;

  • a DataFrame or a named Series to merge the first one with (you can specify the name parameter when creating a Series).

We will use the DataFrame called junior_class from the previous section. Let's assume that we want to merge it with a new DataFrame containing information about the age of participants. For the sake of illustration, we will include only three members:

age_of_participants = pd.DataFrame({'Name': ['Ann', 'Eric', 'Ella'],
                                    'Age': ['16', '16', '18']})

Now, let's just pass both DataFrame objects to merge():

junior_class.merge(age_of_participants)

#    Name 100m (sec.) 2km (min., sec.) Age
# 0   Ann        16.3             9,24  16
# 1  Eric        14.3             8,14  16

As you can see, we have only two rows. Both junior_class and age_of_participants have the same column 'Name'. So, by default, they are joined on the basis of that one matching key. Only the rows that have overlapping values of 'Name' in both DataFrames are outputted: the information about the age of other participants is unavailable and they will have NaN against their names in the column 'Age', so their results are not included in the final table. This happens because merge() has a parameter similar to join of concat(), and it's by default set to the inner type of joining. Let's look through some other options and parameters to adjust the results:

  • how — defines the way of merging. Possible values are 'inner', 'outer', 'left', and 'right' (default 'inner'). Above, we saw an example of the inner combining; if we specified 'outer', we would get the union of all the data. Let's see what happens if we write how='left':

    junior_class.merge(age_of_participants, how='left')
    
    #      Name 100m (sec.) 2km (min., sec.)  Age
    # 0     Ann        16.3             9,24   16
    # 1    Kate        17.1             9,45  NaN
    # 2  George        14.8             9,17  NaN
    # 3    Eric        14.3             8,14   16

    We can see all entries from the first dataset, even though some values in the column 'Age' are absent. Similarly, how='right' will get us all rows from the second DataFrame. For example, the row, corresponding to the name 'Ella', has NaN values in '100m (sec.)' and '2km (min., sec.)', since junior_class contains no information about her scores:

    junior_class.merge(age_of_participants, how='right')
    
    #    Name 100m (sec.) 2km (min., sec.) Age
    # 0   Ann        16.3             9,24  16
    # 1  Eric        14.3             8,14  16
    # 2  Ella         NaN              NaN  18
  • on — specifies columns and index level names on which we want to join. By default, the columns with the same names are used as join keys; alternatively, we can pass the value(s) as a string or list, but the names should still be present in both objects we combine. In case we want to merge objects with different column names, we can turn to parameters left_on and right_on: the left DataFrame is the one we pass first (junior_class in the example above) and the right DataFrame is the one to join with (age_of_participants above). Accordingly, left_on accepts names from the left DataFrame, whereas right_on takes names we can find in the right one.

Differences

Now that we've looked through the use cases of both functions, let's clarify the key differences between them:

  1. Main use cases. Generally, concat() is used to simply place several objects next to each other or one on top of another; at the same time, merge() is mainly utilized for database-like joining — its set of parameters makes joining more flexible and more content-aware.

  2. The number of objects we can join. concat() can accept a sequence of multiple objects, while merge() allows us to join only two DataFrames or a DataFrame and a named Series.

  3. Axes. When using concat(), you can specify the axis along which you need to join objects; merge() supports only side-by-side joining.

  4. Database-like operations. concat() can perform only either inner or outer join, while merge() can do inner, outer, left, and right types of joining.

Conclusions

In this topic, you got acquainted with the concat() and merge() functions. Here are the main things to remember:

  • concat() is the function used to perform simple concatenation on two or more DataFrame or Series objects. Its main parameters are axis, ignore_index, keys, and join.

  • merge() is the function aimed mainly at SQL-style of combining on two DataFrames or a DataFrame and a named Series. It accepts such arguments as how, on, left_on and right_on.

  • Based on the differences in usage, the two functions have differences in their functionality, such as the number of objects to join, the axis, and types of join.

Read more on this topic in Exploring Pandas Library for Python on Hyperskill Blog.

84 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo