As you already know, there are two main data structures in pandas: one-dimensional Series and two-dimensional DataFrame. Earlier, you learned to create objects of both types. You also learned that you can transfer data from different files and tables to DataFrames. Now, imagine that you have to work with multiple similar or related datasets and process them in the same way. What's the first thing you could do to ease this task? Perhaps, you would like to combine them. Luckily for you, pandas suggests several ways to do it. In this topic, you will learn how to join DataFrame and Series objects using the concat() and merge() functions.
As usual, don't forget to import pandas!
import pandas as pdConcatenating objects
The concat() function is used to concatenate or glue multiple objects together along a horizontal or vertical axis. To do so, we need to pass multiple Series or DataFrame objects as arguments to the function. But first, we will create two DataFrame instances — the tables that store students' names and their results after running distances of 100 meters and 2 kilometers respectively:
junior_class = pd.DataFrame({'Name': ['Ann', 'Kate', 'George', 'Eric'],
'100m (sec.)': ['16.3', '17.1', '14.8', '14.3'],
'2km (min., sec.)': ['9,24', '9,45', '9,17', '8,14']},
index=[1, 2, 3, 4])
senior_class = pd.DataFrame({'Name': ['Jack', 'Alicia', 'Ella', 'James'],
'100m (sec.)': ['15.9', '17.8', '17.0', '15.0'],
'2km (min., sec.)': ['8,18', '9,02', '8,58', '7,58']})Mind that their indexes are different and independent of each other: the first DataFrame is indexed in the range 1-4 and the second one has the default 0, 1, 2, ... . It is done for illustration purposes: indexing is important in concatenation, so we will show how such differences affect the results and how you can reset the initial values.
Now, we should pass the DataFrames to concat() as a sequence or mapping:
pd.concat([junior_class, senior_class])
# Name 100m (sec.) 2km (min., sec.)
# 1 Ann 16.3 9,24
# 2 Kate 17.1 9,45
# 3 George 14.8 9,17
# 4 Eric 14.3 8,14
# 0 Jack 15.9 8,18
# 1 Alicia 17.8 9,02
# 2 Ella 17.0 8,58
# 3 James 15.0 7,58As you could notice, the second DataFrame was simply added after the first one. What if you want to add them alongside each other? Also, all the original indexes remained. While it makes sense when dealing with meaningful indexes (for instance, years), in our case we would probably like to recalculate them. These and some other issues can be solved by adjusting the following parameters:
axis— the axis along which to concatenate. Possible values are 0 and 1 (default '0'):axis=0stands for combining along rows, andaxis=1is for combining along columns. Let's now set the value ofaxisto 1:pd.concat([junior_class, senior_class], axis=1) # Name 100m (sec.) 2km (min., sec.) Name 100m (sec.) 2km (min., sec.) # 0 NaN NaN NaN Jack 15.9 8,18 # 1 Ann 16.3 9,24 Alicia 17.8 9,02 # 2 Kate 17.1 9,45 Ella 17.0 8,58 # 3 George 14.8 9,17 James 15.0 7,58 # 4 Eric 14.3 8,14 NaN NaN NaNDid you notice that some fields are filled with NaN values? NaN stands for 'Not a number'. That's the way of handling missing values in
pandas: as we already noticed,junior_classandsenior_classare indexed differently. So, every time there are such mismatches, unmatched fields of data will be filled with NaNs.ignore_index— keeping or resetting the original indexes when concatenating. Possible values are 'True' and 'False' (default 'False'). If you want your resultant object to be reordered, specifyignore_index=True. Now, the axis will be labeled with numeric indexes starting with 0:pd.concat([junior_class, senior_class], ignore_index=True) # Name 100m (sec.) 2km (min., sec.) # 0 Ann 16.3 9,24 # 1 Kate 17.1 9,45 # 2 George 14.8 9,17 # 3 Eric 14.3 8,14 # 4 Jack 15.9 8,18 # 5 Alicia 17.8 9,02 # 6 Ella 17.0 8,58 # 7 James 15.0 7,58join— combining with either the 'outer' or 'inner' type of joining (default 'outer'). The outer type of joining returns the union of all objects, that is, all their original rows will be preserved. On the contrary, the inner type includes only the rows that are labeled with indexes present in both datasets, excluding all other rows. Take a look at the following example: rows marked by numbers 1, 2, and 3 are present, but rows marked with 0 (fromsenior_class) and 4 (fromjunior_class) are eliminated:pd.concat([junior_class, senior_class], axis=1, join='inner') # Name 100m (sec.) 2km (min., sec.) Name 100m (sec.) 2km (min., sec.) # 1 Ann 16.3 9,24 Alicia 17.8 9,02 # 2 Kate 17.1 9,45 Ella 17.0 8,58 # 3 George 14.8 9,17 James 15.0 7,58
keys— adding a new level of labels to indicate, for example, from whichSeriesorDataFramethe values come from or group them in some other way. The names should be passed as a sequence, for example, list:pd.concat([junior_class, senior_class], keys=['Jun. class', 'Sen. class']) # Name 100m (sec.) 2km (min., sec.) # Jun. class 1 Ann 16.3 9,24 # 2 Kate 17.1 9,45 # 3 George 14.8 9,17 # 4 Eric 14.3 8,14 # Sen. class 0 Jack 15.9 8,18 # 1 Alicia 17.8 9,02 # 2 Ella 17.0 8,58 # 3 James 15.0 7,58
Merging objects
Compared to concat(), merge() is a more flexible tool for combining that provides possibilities to go a little deeper into the structure of objects. The function is rooted in the idea of the so-called database-style joining — joining on the basis of shared columns.
If you are familiar with SQL and SQL types of joins, the principal functionality of merge() will remind you of the following operations: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN.
The function needs two objects to operate on:
a
DataFrameto join another object to;a
DataFrameor a namedSeriesto merge the first one with (you can specify thenameparameter when creating aSeries).
We will use the DataFrame called junior_class from the previous section. Let's assume that we want to merge it with a new DataFrame containing information about the age of participants. For the sake of illustration, we will include only three members:
age_of_participants = pd.DataFrame({'Name': ['Ann', 'Eric', 'Ella'],
'Age': ['16', '16', '18']})Now, let's just pass both DataFrame objects to merge():
junior_class.merge(age_of_participants)
# Name 100m (sec.) 2km (min., sec.) Age
# 0 Ann 16.3 9,24 16
# 1 Eric 14.3 8,14 16As you can see, we have only two rows. Both junior_class and age_of_participants have the same column 'Name'. So, by default, they are joined on the basis of that one matching key. Only the rows that have overlapping values of 'Name' in both DataFrames are outputted: the information about the age of other participants is unavailable and they will have NaN against their names in the column 'Age', so their results are not included in the final table. This happens because merge() has a parameter similar to join of concat(), and it's by default set to the inner type of joining. Let's look through some other options and parameters to adjust the results:
how— defines the way of merging. Possible values are 'inner', 'outer', 'left', and 'right' (default 'inner'). Above, we saw an example of the inner combining; if we specified 'outer', we would get the union of all the data. Let's see what happens if we writehow='left':junior_class.merge(age_of_participants, how='left') # Name 100m (sec.) 2km (min., sec.) Age # 0 Ann 16.3 9,24 16 # 1 Kate 17.1 9,45 NaN # 2 George 14.8 9,17 NaN # 3 Eric 14.3 8,14 16We can see all entries from the first dataset, even though some values in the column 'Age' are absent. Similarly,
how='right'will get us all rows from the secondDataFrame. For example, the row, corresponding to the name 'Ella', has NaN values in '100m (sec.)' and '2km (min., sec.)', sincejunior_classcontains no information about her scores:junior_class.merge(age_of_participants, how='right') # Name 100m (sec.) 2km (min., sec.) Age # 0 Ann 16.3 9,24 16 # 1 Eric 14.3 8,14 16 # 2 Ella NaN NaN 18on— specifies columns and index level names on which we want to join. By default, the columns with the same names are used as join keys; alternatively, we can pass the value(s) as a string or list, but the names should still be present in both objects we combine. In case we want to merge objects with different column names, we can turn to parametersleft_onandright_on: the leftDataFrameis the one we pass first (junior_classin the example above) and the rightDataFrameis the one to join with (age_of_participantsabove). Accordingly,left_onaccepts names from the leftDataFrame, whereasright_ontakes names we can find in the right one.
Differences
Now that we've looked through the use cases of both functions, let's clarify the key differences between them:
Main use cases. Generally,
concat()is used to simply place several objects next to each other or one on top of another; at the same time,merge()is mainly utilized for database-like joining — its set of parameters makes joining more flexible and more content-aware.The number of objects we can join.
concat()can accept a sequence of multiple objects, whilemerge()allows us to join only twoDataFramesor aDataFrameand a namedSeries.Axes. When using
concat(), you can specify the axis along which you need to join objects;merge()supports only side-by-side joining.Database-like operations.
concat()can perform only either inner or outer join, whilemerge()can do inner, outer, left, and right types of joining.
Conclusions
In this topic, you got acquainted with the concat() and merge() functions. Here are the main things to remember:
concat()is the function used to perform simple concatenation on two or moreDataFrameorSeriesobjects. Its main parameters areaxis,ignore_index,keys, andjoin.merge()is the function aimed mainly at SQL-style of combining on twoDataFramesor aDataFrameand a namedSeries. It accepts such arguments ashow,on,left_onandright_on.Based on the differences in usage, the two functions have differences in their functionality, such as the number of objects to join, the axis, and types of join.
Read more on this topic in Exploring Pandas Library for Python on Hyperskill Blog.