Computer scienceData scienceInstrumentsPandasStoring data with pandas

Introduction to pandas

8 minutes read

In practice, data is often stored in the form of a table, for example, an Excel spreadsheet, a CSV file, or an SQL database. Imagine that you need to analyze this tabular data and get some useful insights from it. Let's think of the task you will need to perform.

First, you will need to load data from different formats preserving its tabular structure and probably join several tables together. Then, to perform the actual data analysis, you will definitely want to access different columns, rows, and cells of the tables, compute some overall statistics, create pivot tables and maybe even make basic plots. Is there a tool in Python that combines all these functionalities? The answer is yes!

This topic will introduce you to pandas, a powerful open-source library for data manipulation and analysis. You will learn how to install the library and get an idea of its main functionality.

The pandas package logo

Installing pandas

pandas is not included in the standard Python library, so you might need to install it separately, for example, using pip. Type the following in your command line:

pip install pandas

Note that pandas is built on top of NumPy which will be installed as well. Besides, there are many optional dependencies. For instance, if you want to use pandas data visualization functionality, you need to install matplotlib, a plotting library in Python. You can install those libraries using pip as well.

Once the installation is complete, you will be able to import it in your code. Since pandas is quite a long name, it is commonly abbreviated and imported as pd:

import pandas as pd

Note that new versions of pandas are released once in a while, with new functionalities and fixed bugs. So it would be a good idea to keep an eye on the updates. You can easily upgrade the version of pandas on your machine with this command: pip install --upgrade pandas.

Inside pandas

As stated in its guidelines, pandas aims to become "the most powerful and flexible open-source data analysis/manipulation tool available in any language".

The name pandas is derived from 'panel data', a term that is used in statistics and econometrics to refer to data sets containing observations over multiple time periods for the same individuals. This library will be helpful if you are working with tabular data, such as data stored in spreadsheets or databases. Apart from that, pandas offers great support for time series and provides extensive functionality for working with dates, times, and time-indexed data.

With the help of pandas, one can easily perform the most typical data processing steps. In particular, the package makes it convenient to load and save data, as it supports out-of-the-box integration with many commonly-used tabular formats such as .csv, .xlsx, as well as SQL databases.

Let's look at what else we can get from pandas!

  • Intuitive merging and joining of data sets allow for easily combining data from different sources, while flexible reshaping tools help construct statistical data summaries.

  • With pandas we can edit, sort, reshape and explore tabular data. For example, you could get all unique values from a column just with one command.

  • Missing values in the data are represented as NaN and can be easily handled, for example, they can be replaced by some value, using built-in functionality.

  • If you install matplotlib, a Python plotting library, you can use the pandas built-in plotting functionality to make a basic plot from your data to better understand it.

  • You can get basic statistical information about your data with literally one line of code.

  • It can be integrated with other libraries for machine learning, such as sklearn.

Finally, pandas is open-source software, which makes it very popular in both academic and commercial domains. To discover the full potential of pandas, take a look at the documentation.

Data structures in pandas

The two data structures of pandas are Series (1D) and DataFrame (2D). You will get more familiar with them in the dedicated topics — we'll just provide an overview.

Series is a one-dimensional array that stores elements of the same data type.

Each element stored in a Series is associated with a label called index. By default, this index is just a sequence (0, 1, 2, ...) . However, you can use any custom values. For example, when analyzing time series, timestamps are typically set as indexes.

Indexes, as well as automatic and explicit data alignment based on them, are the core of pandas.

DataFrame, in turn, is a two-dimensional data structure that is used to represent tabular data with columns of potentially different data types. You can see DataFrame as a table, each column of which is a Series object. In other words, DataFrame is a container for Series, while Series is a container for scalars.

This is illustrated with the example below. The three Series objects store names, surnames, and ages of students respectively, while the DataFrame combines this information in a single table.

An illustration of how 3 Series objects(First name, Family name, and Age) form a DataFrame with 3 corresponding columns

Note that each row in the DataFrame is also associated with an index (which can be numeric or a label-based). It is worth mentioning that even though DataFrame is a 2D data structure, one can still use it to represent higher-dimensional data by using the so-called multi-index, that is, assigning several indexes, or labels, to each row.

Tables represented as DataFrame slightly differ from spreadsheets to optimize computations. Series (= columns of a DataFrame) can store data of one type only to perform operations on them faster. Series uses NumPy arrays (or their extensions) under the hood, and the same data type limitation is imposed due to optimization.

A note on the usage of the 'same data type'

When talking about storing the data of the same data type, we are not referring to the fundamental Python data types (such as str or int). A Series object can store both the str and the int, but a Series can only have a single data type. In this case of str and int in a single Series, it will be cast to the object data type. To quote the official documentation:

When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and integers, the resulting array will be of float dtype.

In pandas, when adding or removing values from a Series object, a new Series object is created. This helps pandas efficiently manage memory usage for Series. Specifically:

  1. Adding values: it is not possible to directly add values to an existing Series object. Instead, when you perform an operation that appears to add values to a Series, pandas creates a new Series object with the combined values from the original Series and the added values.

  2. Removing values: by default, removing values from a Series also creates a new Series object that contains only the remaining values. However, pandas provides an option to modify the existing Series object in-place when removing values, although this still involves creating a copy of the original object internally.

In both cases of adding or removing values, even when modifying the Series in-place, pandas generates a copy of the original object to ensure that the original data remains unchanged. This copy operation is performed efficiently and helps maintain the immutability of pandas objects.

Since pandas is designed to preserve immutability, it is generally not recommended to perform operations in place, because modifying the original object instead of creating a new object might lead to unexpected results and issues with debugging the code.

Conclusion

Here is what you should know about pandas:

  • pandas is a flexible tool for data analysis.

  • pandas is a perfect tool to work with heterogeneous data.

  • There are two data structures in pandas: Series (1D) and DataFrame (2D).

  • Series stores values of the same data type, while columns in a DataFrame can be of different types.

Read more on this topic in Exploring Pandas Library for Python on Hyperskill Blog.

188 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo