Computer scienceData scienceInstrumentsPandasStoring data with pandas

Reading data in pandas

8 minutes read

Reading data is one of the first steps in any machine learning pipeline. In this topic, we will learn how to properly read different types of data in Pandas.

Type of data to read: excel, csv, json, sql

In Pandas, you can read different types of data from various sources, such as Excel files, CSV files, JSON files, and SQL databases. To read files of different types, you need to add the file extension to the read function. For example, to read csv files, you should add csv to read_: pandas.read_csv() and so on. All of the files will be loaded as DataFrames.

import pandas as pd

# EXCEL 
df = pd.read_excel('file_path.xlsx')

# CSV 
df = pd.read_csv('file_path.csv')

# JSON
df = pd.read_json('file_path.json')

To read data from a SQL database, you'll need the pandas library and the sqlalchemy library installed. You can use the pd.read_sql() function with a SQL query to fetch data from the database.

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('database_connection_string')
query = "SELECT * FROM table_name"
df = pd.read_sql(query, engine)

Pandas read exploring: .head(), .info(), .describe()

First, we create an example of a pandas frame with information about cars: their names, prices, and whether they were broken.

import pandas as pd

data = {
    'Car Name': ['Toyota Corolla', 'Honda Civic', 'Ford Mustang', 'Chevrolet Camaro', 'Tesla Model 3', 'BMW X5', 'Audi A4'],
    'Price': [25000, 22000, 35000, 40000, 50000, 60000, 30000],
    'Is Broken': [False, True, True, False, True, False, False]
}

df_cars = pd.DataFrame(data)
           Car Name  Price  Is Broken
0   Toyota Corolla  25000      False
1      Honda Civic  22000       True
2     Ford Mustang  35000       True
3  Chevrolet Camaro  40000      False
4    Tesla Model 3  50000       True
5           BMW X5  60000      False
6          Audi A4  30000      False

When you first get your DataFrame, you need to understand the information present in it. There are three main methods to do it: .head(), .info(), .describe().

  • The .head() method in Pandas is a useful method to preview the first few rows of a DataFrame. It is commonly used to get a quick overview of the data and understand its structure, column names, and the actual values in the DataFrame. You can pass the number of rows to display as a parameter n. By default, you will display only 5 rows.

    df_cars.head(n=3) # display the 3 rows 
             Car Name  Price  Is Broken
    0  Toyota Corolla  25000      False
    1     Honda Civic  22000       True
    2    Ford Mustang  35000       True
  • The .info() method in Pandas is used to display information about a DataFrame: data types of each column, the number of non-null values, memory usage, and more.

    df_cars.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 7 entries, 0 to 6
    Data columns (total 3 columns):
     #   Column     Non-Null Count  Dtype 
    ---  ------     --------------  ----- 
     0   Car Name   7 non-null      object
     1   Price      7 non-null      int64 
     2   Is Broken  7 non-null      bool  
    dtypes: bool(1), int64(1), object(1)
    memory usage: 247.0+ bytes

    When we called df.info(), it provided the following information:

    • <class 'pandas.core.frame.DataFrame'>: Indicates that df is a DataFrame.

    • RangeIndex: 7 entries, 0 to 6: Shows that the DataFrame has a range index from 0 to 6, indicating 7 rows.

    • Data columns (total 3 columns): Indicates that the DataFrame has 3 columns.

    • Column: Lists the column names ('Car Name', 'Price', and 'Is Broken').

    • Non-Null Count: Shows the number of non-null values for each column. In our example, there are no null values.

    • Dtype: Specifies the data type of each column. 'Car Name' is of type 'object' (usually representing strings), 'Price' is of type 'int64' (integer), and 'Is Broken' is of type bool. Note that sometimes columns can be inferred in the wrong way, so further type inspection and conversion might be required. You can set the types of the columns in dtype argument while reading the file. Alternatively, you can use df.as_type() after loading the DataFrame to specify the correct types.

    • memory usage: Displays the memory usage of the DataFrame.

  • The .describe() method in Pandas is used to generate a summary of statistics for numerical columns in a DataFrame. You can get the mean, count, std, first, second, and third quartiles, and other statistics of the data.

    df_cars.describe()
                  Price
    count      7.000000
    mean   37428.571429
    std    13709.572603
    min    22000.000000
    25%    27500.000000
    50%    35000.000000
    75%    45000.000000
    max    60000.000000

Data file options: sep (delimiter), header, index_col

Using all of the data types, you can specify the way to read them with such parameters as sep, header and index_col . Let's download a csv file with information about different cars. The link to the file.

df_cars = pd.read_csv('./cars.csv')
  • sep (equal to delimiter): The sep parameter is used when reading files with custom delimiters. For example, if you have a CSV file with a different delimiter like a semicolon ';', you can specify it using the sep parameter. In our example, columns are separated using ','. We can set it as a parameter.

    df_cars = pd.read_csv('./cars.csv', sep=',')

    Note that using a wrong sep can lead to the wrong reading of the file. For instance, if we use ';' as a delimiter, we will get only one column with all values merged.

    df_cars = pd.read_csv('./cars.csv', delimiter=';')
      Car Name,Price,Condition,Year,Fuel Type
    0  Toyota Corolla,25000,New,2023,Gasoline
    1    Honda Civic,22000,Used,2021,Gasoline
  • The header parameter allows you to specify which row in the data file should be considered as the header (column names). By default, it is set to 0, indicating the first row as the header. Let's use the third row as a header.

    df_cars = pd.read_csv('./cars.csv', header=3)
           Ford Mustang  35000   New  2023  Gasoline
    0  Chevrolet Camaro  40000  Used  2020  Gasoline
    1     Tesla Model 3  50000   New  2023  Electric
  • The index_col parameter specifies which column should be used as the index of the DataFrame. It can take an integer or column name as a value. In our example, we can set the Car Name as an index column.

    df_cars = pd.read_csv('./cars.csv', index_col='Car Name')
                    Price Condition  Year Fuel Type
    Car Name                                       
    Toyota Corolla  25000       New  2023  Gasoline
    Honda Civic     22000      Used  2021  Gasoline

Polars library to read data

In our previous example, we used the Pandas library, however, it can be inefficient to work with large files. One alternative is Polars. Polars is a fast and efficient DataFrame library for data manipulation, similar to Pandas. It can handle large datasets with excellent performance. All of the functions and methods are very similar to Pandas. First, you should install the Polars library.

pip install polars
df_cars = pl.read_csv('./cars.csv')

Note that you will see also the type of data of each column when displaying the DataFrame.

shape: (3, 5)
┌────────────────┬───────┬───────────┬──────┬───────────┐
│ Car Name       ┆ Price ┆ Condition ┆ Year ┆ Fuel Type │
│ ---            ┆ ---   ┆ ---       ┆ ---  ┆ ---       │
│ str            ┆ i64   ┆ str       ┆ i64  ┆ str       │
╞════════════════╪═══════╪═══════════╪══════╪═══════════╡
│ Toyota Corolla ┆ 25000 ┆ New       ┆ 2023 ┆ Gasoline  │
│ Honda Civic    ┆ 22000 ┆ Used      ┆ 2021 ┆ Gasoline  │
│ Ford Mustang   ┆ 35000 ┆ New       ┆ 2023 ┆ Gasoline  │
└────────────────┴───────┴───────────┴──────┴───────────┘

Conclusion

In this topic, we covered the basics of data reading with Pandas, exploring essential methods such as .head(), .info(), and .describe(). We also learned how to read data from various file formats, including Excel, CSV, JSON, and SQL databases. Additionally, we briefly introduced the Polars library for data reading as an alternative to Pandas. Now you have a solid foundation to start loading and exploring data in Python for further analysis and manipulation.

Read more on this topic in Exploring Pandas Library for Python on Hyperskill Blog.

20 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo