Sometimes, we may want to access a piece of information stored in a particular row or a column instead of working with the whole DataFrame. The good news is that pandas has just the right solution for it. It is called indexing, and we can select a particular subset of a DataFrame or a Series to work with it.
.loc
Before we start, let's import pandas (abbreviated as pd) and create a DataFrame from a dictionary:
import pandas as pd
people = {
"first_name": ["Michael", "Michael", 'Jane', 'John'],
"last_name": ["Jackson", "Jordan", 'Doe', 'Doe'],
"email": ["[email protected]", "[email protected]",
'[email protected]', '[email protected]'],
"birthday": ["29.09.1958", "17.02.1963", "15.03.1978", "12.05.1979"],
"height": [1.75, 1.98, 1.64, 1.8]
}
df = pd.DataFrame(people)
df.head()Here is the output:
first_name last_name email birthday height
0 Michael Jackson [email protected] 29.09.1958 1.75
1 Michael Jordan [email protected] 17.02.1963 1.98
2 Jane Doe [email protected] 15.03.1978 1.64
3 John Doe [email protected] 12.05.1979 1.80pandas provides two additional features for selecting a subset of rows and columns: .loc and .iloc. The first one stands for locator and is label-based. .iloc stands for integer locator and is integer position-based. Note that both features aren't methods: they are Python properties, and that's why they use square brackets. First, remember that their core syntax is similar:
.loc[<row selection>, <optional column selection>]
.iloc[<row selection>, <optional column selection>]Let's start with .loc. It can handle integer-based indexes as labels, but for clarity, we will create and name a text index:
df.index = ['first', 'second', 'third', 'fourth']
df.index.name = 'index'
df.head()Output:
first_name last_name email birthday height
index
first Michael Jackson [email protected] 29.09.1958 1.75
second Michael Jordan [email protected] 17.02.1963 1.98
third Jane Doe [email protected] 15.03.1978 1.64
fourth John Doe [email protected] 12.05.1979 1.80.loc can take:
a single row label;
a list of row labels;
a slice of row labels;
a result of conditional statements (a boolean array)
We could also pass columns as the second argument in a similar manner: a single label, a list, or a slice.
If we pass a single argument, pandas will return a Series:
df.loc['third']Output:
first_name Jane
last_name Doe
email [email protected]
birthday 15.03.1978
height 1.64
Name: third, dtype: objectYou can also select a single cell:
df.loc['third', 'last_name']Output:
'Doe'As you can see, we returned a cell value. In this case, it is of the String type.
To pass a list of labels, we need to do the following:
df.loc[['first','fourth']]We get the rows with the first and fourth indexes:
first_name last_name email birthday height
index
first Michael Jackson [email protected] 29.09.1958 1.75
fourth John Doe [email protected] 12.05.1979 1.80Let's add a column list of labels:
df.loc[['first','fourth'], ['last_name', 'birthday']]Output:
last_name birthday
index
first Jackson 29.09.1958
fourth Doe 12.05.1979Note that the first list inside the loc square brackets defines the row selection while the second list defines the column selection.
Here comes a slice of row labels:
df.loc['second':'fourth']Output (notice how both the beginning ('second') and the end ('fourth') of the slice are included, unlike the usual Python slicing behavior that excludes the end itself):
first_name last_name email birthday height
index
second Michael Jordan [email protected] 17.02.1963 1.98
third Jane Doe [email protected] 15.03.1978 1.64
fourth John Doe [email protected] 12.05.1979 1.80Same as before, we can introduce a condition (with a column slice):
df.loc[df.birthday == '12.05.1979', 'last_name':'birthday':2]Output:
last_name birthday
index
fourth Doe 12.05.1979The first argument here takes a row while the birthday column is set at 12.05.1979. The second argument takes columns from last_name to birthday with a step of 2. That is, it takes every second column, starting from the first one selected.
Feel free to choose any combination of single values, lists, and slices.
.iloc
Now, move on to .iloc. The core syntax is the same, but this one focuses on the ordinal integer indexes; we cannot use conditionals here. So, switch back to the initial DataFrame by resetting and dropping the label index — we don't need it anymore:
df.reset_index(drop=True, inplace=True)
df.head()Output:
first_name last_name email birthday height
0 Michael Jackson [email protected] 29.09.1958 1.75
1 Michael Jordan [email protected] 17.02.1963 1.98
2 Jane Doe [email protected] 15.03.1978 1.64
3 John Doe [email protected] 12.05.1979 1.80At first, let's select the first row and column value:
df.iloc[0, 0]We returned the top-left cell.
'Michael'We can also select four inner cells:
df.iloc[[1, 2], [1, 2]]Output:
last_name email
1 Jordan [email protected]
2 Doe [email protected]Don't forget about the step! To define a step k within a row interval [x,y], use the following syntax: df.iloc[x:y:k, :]. For example, we can list every second row (starting from zero) with this line of code:
df.iloc[::2, :]Output:
first_name last_name email birthday height
0 Michael Jackson [email protected] 29.09.1958 1.75
2 Jane Doe [email protected] 15.03.1978 1.64Awesome, isn't it? This technique looks simple if you're already familiar with Python lists.
Note that .iloc takes an integer position. It means that if we don't have an end-to-end line numbering, it will take the row positions. So if we have fancy indexing like this:
a b
10 1 4
0 2 5
20 3 6df.iloc[0] will still select the first row (with an index of 10):
a 1
b 4
Name: 10, dtype: int64And df.loc[0] will select the second row (with an index of 0):
a 2
b 5
Name: 0, dtype: int64Use .loc and .iloc when you want to change a part of a DataFrame.
To sum up, let's look at the main differences between .loc and .iloc in one table:
|
|
|
|---|---|---|
Conditional row selection | Yes | No |
Takes rows as | Index names | Index integer position |
Takes columns as | Column names | Column integer position |
Modifying a DataFrame with loc & iloc
Both methods are not only a convenient way to select a part of a DataFrame, but also help modify a part of a DataFrame just with one code line. Let's imagine a situation: to save personal data on a server, users must send you a Data Processing Agreement (DPA). Suppose you didn't get the DPA from Jane & John Doe. Let's update our data:
df.iloc[2:, 2:5] = "no DPA"Here is what we'll get:
first_name last_name email birthday height
0 Michael Jackson [email protected] 29.09.1958 1.75
1 Michael Jordan [email protected] 17.02.1963 1.98
2 Jane Doe no DPA no DPA no DPA
3 John Doe no DPA no DPA no DPAConclusion
Now you know how to select subsets based on their integer position with .iloc and based on labels with .loc. Of course, the list of useful methods goes on and you will learn about them in due time. In some cases, it will be easier to use .loc with a condition, in others — with basic dot-syntax selecting. Feel free to experiment!
Read more on this topic in Exploring Pandas Library for Python on Hyperskill Blog.