Reading data is one of the first steps in any machine learning pipeline. In this topic, we will learn how to properly read different types of data in Pandas.
Type of data to read: excel, csv, json, sql
In Pandas, you can read different types of data from various sources, such as Excel files, CSV files, JSON files, and SQL databases. To read files of different types, you need to add the file extension to the read function. For example, to read csv files, you should add csv to read_: pandas.read_csv() and so on. All of the files will be loaded as DataFrames.
import pandas as pd
# EXCEL
df = pd.read_excel('file_path.xlsx')
# CSV
df = pd.read_csv('file_path.csv')
# JSON
df = pd.read_json('file_path.json')To read data from a SQL database, you'll need the pandas library and the sqlalchemy library installed. You can use the pd.read_sql() function with a SQL query to fetch data from the database.
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('database_connection_string')
query = "SELECT * FROM table_name"
df = pd.read_sql(query, engine)Pandas read exploring: .head(), .info(), .describe()
First, we create an example of a pandas frame with information about cars: their names, prices, and whether they were broken.
import pandas as pd
data = {
'Car Name': ['Toyota Corolla', 'Honda Civic', 'Ford Mustang', 'Chevrolet Camaro', 'Tesla Model 3', 'BMW X5', 'Audi A4'],
'Price': [25000, 22000, 35000, 40000, 50000, 60000, 30000],
'Is Broken': [False, True, True, False, True, False, False]
}
df_cars = pd.DataFrame(data) Car Name Price Is Broken
0 Toyota Corolla 25000 False
1 Honda Civic 22000 True
2 Ford Mustang 35000 True
3 Chevrolet Camaro 40000 False
4 Tesla Model 3 50000 True
5 BMW X5 60000 False
6 Audi A4 30000 FalseWhen you first get your DataFrame, you need to understand the information present in it. There are three main methods to do it: .head(), .info(), .describe().
The
.head()method in Pandas is a useful method to preview the first few rows of a DataFrame. It is commonly used to get a quick overview of the data and understand its structure, column names, and the actual values in the DataFrame. You can pass the number of rows to display as a parameter n. By default, you will display only 5 rows.df_cars.head(n=3) # display the 3 rowsCar Name Price Is Broken 0 Toyota Corolla 25000 False 1 Honda Civic 22000 True 2 Ford Mustang 35000 TrueThe
.info()method in Pandas is used to display information about a DataFrame: data types of each column, the number of non-null values, memory usage, and more.df_cars.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 7 entries, 0 to 6 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Car Name 7 non-null object 1 Price 7 non-null int64 2 Is Broken 7 non-null bool dtypes: bool(1), int64(1), object(1) memory usage: 247.0+ bytesWhen we called
df.info(), it provided the following information:<class 'pandas.core.frame.DataFrame'>: Indicates thatdfis a DataFrame.RangeIndex: 7 entries, 0 to 6: Shows that the DataFrame has a range index from 0 to 6, indicating 7 rows.Data columns (total 3 columns): Indicates that the DataFrame has 3 columns.Column: Lists the column names ('Car Name', 'Price', and 'Is Broken').Non-Null Count: Shows the number of non-null values for each column. In our example, there are no null values.Dtype: Specifies the data type of each column.'Car Name'is of type'object'(usually representing strings),'Price'is of type'int64'(integer), and'Is Broken'is of typebool. Note that sometimes columns can be inferred in the wrong way, so further type inspection and conversion might be required. You can set the types of the columns indtypeargument while reading the file. Alternatively, you can usedf.as_type()after loading the DataFrame to specify the correct types.memory usage: Displays the memory usage of the DataFrame.
The
.describe()method in Pandas is used to generate a summary of statistics for numerical columns in a DataFrame. You can get the mean, count, std, first, second, and third quartiles, and other statistics of the data.df_cars.describe()Price count 7.000000 mean 37428.571429 std 13709.572603 min 22000.000000 25% 27500.000000 50% 35000.000000 75% 45000.000000 max 60000.000000
Data file options: sep (delimiter), header, index_col
Using all of the data types, you can specify the way to read them with such parameters as sep, header and index_col . Let's download a csv file with information about different cars. The link to the file.
df_cars = pd.read_csv('./cars.csv')sep(equal to delimiter): Thesepparameter is used when reading files with custom delimiters. For example, if you have a CSV file with a different delimiter like a semicolon';', you can specify it using thesepparameter. In our example, columns are separated using','. We can set it as a parameter.df_cars = pd.read_csv('./cars.csv', sep=',')Note that using a wrong
sepcan lead to the wrong reading of the file. For instance, if we use';'as a delimiter, we will get only one column with all values merged.df_cars = pd.read_csv('./cars.csv', delimiter=';')Car Name,Price,Condition,Year,Fuel Type 0 Toyota Corolla,25000,New,2023,Gasoline 1 Honda Civic,22000,Used,2021,GasolineThe
headerparameter allows you to specify which row in the data file should be considered as the header (column names). By default, it is set to0, indicating the first row as the header. Let's use the third row as a header.df_cars = pd.read_csv('./cars.csv', header=3)Ford Mustang 35000 New 2023 Gasoline 0 Chevrolet Camaro 40000 Used 2020 Gasoline 1 Tesla Model 3 50000 New 2023 ElectricThe
index_colparameter specifies which column should be used as the index of the DataFrame. It can take an integer or column name as a value. In our example, we can set theCar Nameas an index column.df_cars = pd.read_csv('./cars.csv', index_col='Car Name')Price Condition Year Fuel Type Car Name Toyota Corolla 25000 New 2023 Gasoline Honda Civic 22000 Used 2021 Gasoline
Polars library to read data
In our previous example, we used the Pandas library, however, it can be inefficient to work with large files. One alternative is Polars. Polars is a fast and efficient DataFrame library for data manipulation, similar to Pandas. It can handle large datasets with excellent performance. All of the functions and methods are very similar to Pandas. First, you should install the Polars library.
pip install polarsdf_cars = pl.read_csv('./cars.csv')Note that you will see also the type of data of each column when displaying the DataFrame.
shape: (3, 5)
┌────────────────┬───────┬───────────┬──────┬───────────┐
│ Car Name ┆ Price ┆ Condition ┆ Year ┆ Fuel Type │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 ┆ str │
╞════════════════╪═══════╪═══════════╪══════╪═══════════╡
│ Toyota Corolla ┆ 25000 ┆ New ┆ 2023 ┆ Gasoline │
│ Honda Civic ┆ 22000 ┆ Used ┆ 2021 ┆ Gasoline │
│ Ford Mustang ┆ 35000 ┆ New ┆ 2023 ┆ Gasoline │
└────────────────┴───────┴───────────┴──────┴───────────┘Conclusion
In this topic, we covered the basics of data reading with Pandas, exploring essential methods such as .head(), .info(), and .describe(). We also learned how to read data from various file formats, including Excel, CSV, JSON, and SQL databases. Additionally, we briefly introduced the Polars library for data reading as an alternative to Pandas. Now you have a solid foundation to start loading and exploring data in Python for further analysis and manipulation.
Read more on this topic in Exploring Pandas Library for Python on Hyperskill Blog.