As you already know, pandas is a popular Python library for data manipulation. This topic will introduce you to Series, a basic one-dimensional data structure in pandas.
Series is a building block to the 2D data structure in pandas, DataFrame. The latter is the one you will use the most in practice. So, while it is important to have some idea of Series, in-depth knowledge of its functionality is not necessary.
Before we start, do not forget that you need to import pandas to be able to use all of its functionality, including Series. Note that traditionally, the name of the library is abbreviated as pd in the import statement:
import pandas as pdWhat is Series
Series is a one-dimensional array. For example, here is a Series that stores names of the students in a machine learning class:
You can notice that each element stored in a Series is associated with a label called index. By default, this index is just the sequence 0, 1, 2, ... . However, any custom values can be used. For example, we can store the ages of the students in a Series, and set students' names as row identifiers so that we know to which student every age corresponds:
That's pretty cool, but how do you create a Series object? There are several ways to do it.
Converting other data structures to Series
If the data is already stored in another data structure, you can easily convert it to Series as shown below:
ages_list = [21, 20, 25, 22]
names_list = ['Anna', 'Bob', 'Maria', 'Jack']
ages_series = pd.Series(ages_list, index=names_list, name='Age')
print(ages_series)
# Anna 21
# Bob 20
# Maria 25
# Jack 22
# Name: Age, dtype: int64
Here, we convert a list of students' ages ages_list into a Series object ages_series. We also assign a custom index, students' names stored in the list names_list, using the index keyword. If we fail to provide the values for the index, values 0, 1, 2, 3 would be assigned. Finally, we can also give our Series a name by specifying the optional name parameter.
Similarly, you can convert a Python dictionary into a Series object. Note that dictionary keys (students' names in the example below) will automatically become the indexes in the new Series. Cool, right?
student_ages_dict = {'Anna': 21, 'Bob': 20, 'Maria': 25, 'Jack': 22}
ages_series = pd.Series(student_ages_dict, name='Ages')
print(ages_series)
# Anna 21
# Bob 20
# Maria 25
# Jack 22
# Name: Ages, dtype: int64
You can always change the index later by modifying the index attribute of the Series:
ages_series.index = ['A', 'B', 'M', 'J']
print(ages_series)
# A 21
# B 20
# M 25
# J 22
# Name: Ages, dtype: int64
Of course, the length of the data and index you provide should coincide. Otherwise, an error will occur.
Modifying a Series object
Series is value-mutable; you can easily change the values stored in a Series, for example, by accessing them by index. To illustrate this, let's update Jack's age:
ages_series['Jack'] = 23
print(ages_series)
# Anna 21
# Bob 20
# Maria 25
# Jack 23
# Name: Ages, dtype: int64
At the same time, a Series object is size-immutable; once it's created, no elements can be added to or removed from it without creating a new object (thus, a copy of the object with the applied modification will be returned).
However, you can mutate the Series object by setting the inplace argument of the corresponding modification method to True (note that a copy of an object with the modifications will be created regardless, but it will be re-assigned to the existing Series object automatically).
For example, let's remove Maria's record from our Series. This can be done with the .drop() method.:
new_ages_series = ages_series.drop(index='Maria')
print(new_ages_series)
# Anna 21
# Bob 20
# Jack 23
# Name: Ages, dtype: int64
Note that the original Series remained unchanged:
print(ages_series)
# Anna 21
# Bob 20
# Maria 25
# Jack 23
# Name: Ages, dtype: int64
If you want the returned Series to be automatically assigned to the original one, you can specify the optional inplace parameter to be True :
ages_series.drop(index='Maria', inplace=True)
print(ages_series)
# Anna 21
# Bob 20
# Jack 23
# Name: Ages, dtype: int64
Note that, contrary to the name, inplace=True creates a copy of the object regardless, but just automatically assigns the copy back to the original object. It also tends to introduce unexpected results and doesn't allow to chain methods. There is an issue talking about a possible removal of inplace=True in the future versions of pandas due to these reasons. In general, it's better to avoid the usage of inplace=True.
To add new records to Series, one can explicitly specify the value for the new index. Let's add Maria's record back to the ages_series:
ages_series['Maria'] = 25
print(ages_series)
# Anna 21
# Bob 20
# Jack 23
# Maria 25
# Name: Ages, dtype: int64
If you'd like to add more than one record in a Series, it'll be faster to do this with pd.concat():
new_recs = pd.Series({'Jon': 34, 'Peter': 23, 'Karo': 45, 'Abby': 25})
ages_series = pd.concat([ages_series, new_recs])
print(ages_series)
# Anna 21
# Bob 20
# Jack 23
# Maria 25
# Jon 34
# Peter 23
# Karo 45
# Abby 25
# Name: Ages, dtype: int64Operations on Series
The key feature of pandas is that operations between several Series automatically align the data based on the index.
Let's imagine we have two Series, algebra and calculus, containing students' exam results for Algebra and Calculus courses respectively:
Suppose we want to compute the average score students got for the two exams.
average = 0.5 * (algebra + calculus)
print(average)
# Anna 75.0
# Bob 90.0
# Jack 80.0
# Maria 90.0
# dtype: float64
Note that the order of the students in these two Series is different, and yet the averages are computed correctly. Very convenient!
Alright, but what happens if some indexes from one Series don't exist in the other one? Let's imagine there was also the third exam on Probability, but only Anna and Bob took it. The results are stored in the probability Series:
What will happen if we try to compute the average of the three Series now?
average = 0.33 * (algebra + calculus + probability)
print(average)
# Anna 82.5
# Bob 92.4
# Jack NaN
# Maria NaN
# dtype: float64
As you can see, the result of an operation between unaligned Series has the union of the indexes involved. If a label is not found in one of the operands, the result will be marked as a missing value NaN.
Writing code without doing any explicit data alignment gives you a lot of flexibility in data analysis. The integrated data alignment in pandas is what sets the library apart from the majority of related tools for working with labeled data.
Conclusion
Here is what you learned in this topic:
-
Seriesis a one-dimensional data structure in pandas; -
Seriesstore values along with their labels called indexes; -
Seriesis value-mutable but size-immutable: one can modify values stored in it, but cannot add new values to or remove values from it without a new object being created under the hood (the object can be changed withinplace=True, but a copy will be created anyway); -
when performing operations on several
Seriesobjects, they are automatically aligned on the index.
Read more on this topic in Exploring Pandas Library for Python on Hyperskill Blog.