Computer scienceData scienceInstrumentsPandasStoring data with pandas

Series

11 minutes read

As you already know, pandas is a popular Python library for data manipulation. This topic will introduce you to Series, a basic one-dimensional data structure in pandas.

Series is a building block to the 2D data structure in pandas, DataFrame. The latter is the one you will use the most in practice. So, while it is important to have some idea of Series, in-depth knowledge of its functionality is not necessary.

Before we start, do not forget that you need to import pandas to be able to use all of its functionality, including Series. Note that traditionally, the name of the library is abbreviated as pd in the import statement:

import pandas as pd

What is Series

Series is a one-dimensional array. For example, here is a Series that stores names of the students in a machine learning class:

Pandas series with integer index

You can notice that each element stored in a Series is associated with a label called index. By default, this index is just the sequence 0, 1, 2, ... . However, any custom values can be used. For example, we can store the ages of the students in a Series, and set students' names as row identifiers so that we know to which student every age corresponds:

Pandas series with string index

That's pretty cool, but how do you create a Series object? There are several ways to do it.

Converting other data structures to Series

If the data is already stored in another data structure, you can easily convert it to Series as shown below:

ages_list = [21, 20, 25, 22]
names_list = ['Anna', 'Bob', 'Maria', 'Jack']

ages_series = pd.Series(ages_list, index=names_list, name='Age')
print(ages_series)

# Anna     21
# Bob      20
# Maria    25
# Jack     22
# Name: Age, dtype: int64

Here, we convert a list of students' ages ages_list into a Series object ages_series. We also assign a custom index, students' names stored in the list names_list, using the index keyword. If we fail to provide the values for the index, values 0, 1, 2, 3 would be assigned. Finally, we can also give our Series a name by specifying the optional name parameter.

Similarly, you can convert a Python dictionary into a Series object. Note that dictionary keys (students' names in the example below) will automatically become the indexes in the new Series. Cool, right?

student_ages_dict = {'Anna': 21, 'Bob': 20, 'Maria': 25, 'Jack': 22}

ages_series = pd.Series(student_ages_dict, name='Ages')
print(ages_series)

# Anna     21
# Bob      20
# Maria    25
# Jack     22
# Name: Ages, dtype: int64

You can always change the index later by modifying the index attribute of the Series:

ages_series.index = ['A', 'B', 'M', 'J']
print(ages_series)

# A    21
# B    20
# M    25
# J    22
# Name: Ages, dtype: int64

Of course, the length of the data and index you provide should coincide. Otherwise, an error will occur.

Modifying a Series object

Series is value-mutable; you can easily change the values stored in a Series, for example, by accessing them by index. To illustrate this, let's update Jack's age:

ages_series['Jack'] = 23
print(ages_series)

# Anna     21
# Bob      20
# Maria    25
# Jack     23
# Name: Ages, dtype: int64

At the same time, a Series object is size-immutable; once it's created, no elements can be added to or removed from it without creating a new object (thus, a copy of the object with the applied modification will be returned).

However, you can mutate the Series object by setting the inplace argument of the corresponding modification method to True (note that a copy of an object with the modifications will be created regardless, but it will be re-assigned to the existing Series object automatically).

For example, let's remove Maria's record from our Series. This can be done with the .drop() method.:

new_ages_series = ages_series.drop(index='Maria')
print(new_ages_series)

# Anna    21
# Bob     20
# Jack    23
# Name: Ages, dtype: int64

Note that the original Series remained unchanged:

print(ages_series)

# Anna     21
# Bob      20
# Maria    25
# Jack     23
# Name: Ages, dtype: int64

If you want the returned Series to be automatically assigned to the original one, you can specify the optional inplace parameter to be True :

ages_series.drop(index='Maria', inplace=True)
print(ages_series)

# Anna    21
# Bob     20
# Jack    23
# Name: Ages, dtype: int64

Note that, contrary to the name, inplace=True creates a copy of the object regardless, but just automatically assigns the copy back to the original object. It also tends to introduce unexpected results and doesn't allow to chain methods. There is an issue talking about a possible removal of inplace=True in the future versions of pandas due to these reasons. In general, it's better to avoid the usage of inplace=True.

To add new records to Series, one can explicitly specify the value for the new index. Let's add Maria's record back to the ages_series:

ages_series['Maria'] = 25 
print(ages_series)

# Anna     21
# Bob      20
# Jack     23
# Maria    25
# Name: Ages, dtype: int64

If you'd like to add more than one record in a Series, it'll be faster to do this with pd.concat():

new_recs = pd.Series({'Jon': 34, 'Peter': 23, 'Karo': 45, 'Abby': 25})
ages_series = pd.concat([ages_series, new_recs])
print(ages_series)

# Anna     21
# Bob      20
# Jack     23
# Maria    25
# Jon      34
# Peter    23
# Karo     45
# Abby     25
# Name: Ages, dtype: int64

Operations on Series

The key feature of pandas is that operations between several Series automatically align the data based on the index.

Let's imagine we have two Series, algebra and calculus, containing students' exam results for Algebra and Calculus courses respectively:

Pandas series containing scores in algebra and calculus courses

Suppose we want to compute the average score students got for the two exams.

average = 0.5 * (algebra + calculus)
print(average)

# Anna     75.0
# Bob      90.0
# Jack     80.0
# Maria    90.0
# dtype: float64

Note that the order of the students in these two Series is different, and yet the averages are computed correctly. Very convenient!

Alright, but what happens if some indexes from one Series don't exist in the other one? Let's imagine there was also the third exam on Probability, but only Anna and Bob took it. The results are stored in the probability Series:

Pandas series containing scores in the probability course

What will happen if we try to compute the average of the three Series now?

average = 0.33 * (algebra + calculus + probability)
print(average)

# Anna     82.5
# Bob      92.4
# Jack      NaN
# Maria     NaN
# dtype: float64

As you can see, the result of an operation between unaligned Series has the union of the indexes involved. If a label is not found in one of the operands, the result will be marked as a missing value NaN.

Writing code without doing any explicit data alignment gives you a lot of flexibility in data analysis. The integrated data alignment in pandas is what sets the library apart from the majority of related tools for working with labeled data.

Conclusion

Here is what you learned in this topic:

Series is a one-dimensional data structure in pandas;
Series store values along with their labels called indexes;
Series is value-mutable but size-immutable: one can modify values stored in it, but cannot add new values to or remove values from it without a new object being created under the hood (the object can be changed with inplace=True, but a copy will be created anyway);
when performing operations on several Series objects, they are automatically aligned on the index.

Read more on this topic in Exploring Pandas Library for Python on Hyperskill Blog.

167 learners liked this piece of theory. 2 didn't like it. What about you?

Report a typo