Computer scienceData scienceInstrumentsPandasData preprocessing with pandas

Conditional selection

6 minutes read

Sometimes, we may want to access a piece of information stored in a particular row or a column instead of working with a whole DataFrame. The good news is that Pandas allows us to do it. It is called indexing; we can select a particular subset of aDataFrame or a Series to work with it.

Selecting

Before we start, let's import pandas (abbreviated as pd) and create a DataFrame from a dictionary:

import pandas as pd

people = {
    "first_name": ["Michael", "Michael", 'Jane', 'John'], 
    "last_name": ["Jackson", "Jordan", 'Doe', 'Doe'], 
    "email": ["[email protected]", "[email protected]", '[email protected]', '[email protected]'],
    "birthday": ["29.09.1958", "17.02.1963", "15.03.1978", "12.05.1979"],
    "height": [1.75, 1.98, 1.64, 1.8]
}
df = pd.DataFrame(people)
df.head()

Output:

+----+--------------+-------------+--------------------+------------+----------+
|    | first_name   | last_name   | email              | birthday   |   height |
|----+--------------+-------------+--------------------+------------+----------|
|  0 | Michael      | Jackson     | [email protected] | 29.09.1958 |     1.75 |
|  1 | Michael      | Jordan      | [email protected]  | 17.02.1963 |     1.98 |
|  2 | Jane         | Doe         | [email protected]  | 15.03.1978 |     1.64 |
|  3 | John         | Doe         | [email protected]  | 12.05.1979 |     1.8  |
+----+--------------+-------------+--------------------+------------+----------+

We can select any subset of the DataFrame, for example, a single column:

df['email']

Output:

0    [email protected]
1     [email protected]
2     [email protected]
3     [email protected]
Name: email, dtype: object

Now we have a Pandas series with e-mails.

It's also possible to use df.email. It's called a dot syntax. We can use it for column names without spaces.

We can also select a list of columns. A Python list requires additional square brackets:

df[['first_name', 'last_name']]

Output:

+----+--------------+-------------+
|    | first_name   | last_name   |
|----+--------------+-------------|
|  0 | Michael      | Jackson     |
|  1 | Michael      | Jordan      |
|  2 | Jane         | Doe         |
|  3 | John         | Doe         |
+----+--------------+-------------+

Now we got a new DataFrame, consisting of these two columns.

This command always returns a DataFrame, so, even if you select a list that consists of one column, you will return a DataFrame:

df[['last_name']]

Output:

+----+-------------+
|    | last_name   |
|----+-------------|
|  0 | Jackson     |
|  1 | Jordan      |
|  2 | Doe         |
|  3 | Doe         |
+----+-------------+

Conditionals

If we need to introduce a condition to refine our choice, we can include it in square brackets:

df[df.height < 1.8]

Output:

+----+--------------+-------------+--------------------+------------+----------+
|    | first_name   | last_name   | email              | birthday   |   height |
|----+--------------+-------------+--------------------+------------+----------|
|  0 | Michael      | Jackson     | [email protected] | 29.09.1958 |     1.75 |
|  2 | Jane         | Doe         | [email protected]  | 15.03.1978 |     1.64 |
+----+--------------+-------------+--------------------+------------+----------+

As a result, we returned all columns and rows where the height value is less than 1.8. Note that it's possible to use square brackets inside this statement, for example, df[df['height'] < 1.8], but we don't need to do it here, as the column name doesn't contain any whitespaces.

If we need to combine several conditions, we use the following Boolean operators:

  • & for "and"
  • |(vertical line) for "or"
  • ~ for "not"
  • >, <, >=, <=, ==, != for statement comparing.

Please, don't forget about the parentheses:

df[(df.first_name == 'Michael') & (df.birthday == '17.02.1963')]

Output:

+----+--------------+-------------+-------------------+------------+----------+
|    | first_name   | last_name   | email             | birthday   |   height |
|----+--------------+-------------+-------------------+------------+----------|
|  1 | Michael      | Jordan      | [email protected] | 17.02.1963 |     1.98 |
+----+--------------+-------------+-------------------+------------+----------+

As a result, we returned the rows that match our pre-set condition: "the first name is Michael, birthday is on 17.02.1963".

One more example with a bit more complex condition. Let's define the first name as Michael or John, height equal to or more than 1.8, and the last name should not be Jordan:

df[((df.first_name == 'Michael') | (df.first_name == 'John'))
   & (df.height >= 1.8)
   & (df.last_name != 'Jordan')]

Output:

+----+--------------+-------------+-------------------+------------+----------+
|    | first_name   | last_name   | email             | birthday   |   height |
|----+--------------+-------------+-------------------+------------+----------|
|  3 | John         | Doe         | [email protected] | 12.05.1979 |      1.8 |
+----+--------------+-------------+-------------------+------------+----------+

Note, that we don't use the line break character \, since all new lines are put inside the square brackets.

If we want to make our filtering exclusive, in other words, to select everything except the indicated parameters, we can add a tilde character ~ and extra parenthesis:

df[~(((df.first_name == 'Michael') | (df.first_name == 'John'))
   & (df.height >= 1.8)
   & (df.last_name != 'Jordan'))]

Output:

+----+--------------+-------------+--------------------+------------+----------+
|    | first_name   | last_name   | email              | birthday   |   height |
|----+--------------+-------------+--------------------+------------+----------|
|  0 | Michael      | Jackson     | [email protected] | 29.09.1958 |     1.75 |
|  1 | Michael      | Jordan      | [email protected]  | 17.02.1963 |     1.98 |
|  2 | Jane         | Doe         | [email protected]  | 15.03.1978 |     1.64 |
+----+--------------+-------------+--------------------+------------+----------+

Voila!

Conclusion

Now you know how to select:

  • Separate columns or column lists with df[...]
  • Separate columns by using dot syntax
  • DataFrame subsets that fall under a set condition

Of course, the list of various methods goes on. Feel free to experiment.

43 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo