Table of contents
Text Link

Exploring Pandas Library for Python

The pandas library is a quick, robust, user-friendly tool for analyzing and manipulating data. It provides many methods and functions for efficiently working with tabular data, making it a crucial tool for data professionals and machine learning engineers with high performance.

The pandas library supports the import and export of data in various file formats, such as .csv, .xslx, .zip, and .sql. This flexibility enables pandas' integration into diverse data analysis pipelines, accommodating diverse data sources and destinations, which is essential for practical source data analysis..

This article will explore the pandas library; we will take the real-world data from recent-grads.csv file on College Majors. We will work with the two primary data structures of the Pandas library, Series and DataFrame, to provide answers to various questions. We will see how method chaining enables us to write concise, readable code. Recognizing the power of visualization, we will extensively employ visual aids to communicate our results.

Import the Libraries and Data

In addition to the pandas library, we will mainly use the numpy library alongside the matplotlib and seaborn visualization libraries. The matploblib library is the default visualization library for pandas. However, you can change this to other popular visualization libraries. You can use their backend extension for those not expressly supported, such as Holoviews.

Install the Holoviews library:

!pip install holoviews

Then import the libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Import holoviews

Several visualization libraries are interoperable with pandas version. The default is matplotlib; let’s show where to find this information and how to change it from the default.

Run the following code to view the default visualization library:

print(pd.get_option("plotting.backend"))

To change to your preferred backend:

# change to a Plotly backend
pd.set_option("plotting.backend", "plotly")

# change to a Holoviews backend
pd.set_option("plotting.backend", holoviews.extension("bokeh"))

# change back to Matplotlib backend
pd.set_option("plotting.backend", "matplotlib")

To use a backend, you must understand how to visualize in that library. In this article, we will use the matplotlib backend.

We can import the data as a pandas DataFrame with the following code example:

df = pd.read_csv("recent-grads.csv")

The data has many rows and columns. We can not view all the columns and rows because of their optional parameters being set on default. Let’s change that for the columns with the code block:

pd.set_option("display.max_columns", df.shape[1])

The pandas DataFrame is in two dimensions, unlike the one-dimensional pandas Series. The DataFrame, therefore, has rows and columns. We have set the maximum number of columns to be displayed to the number of columns in the DataFrame.

Let's establish a consistent visualization style and background for all plots.

sns.set_style("darkgrid")


plt.style.use("dark_background")

Women and their majors

We will start with a pandas Series. We get a pandas series when we remove a single row from a pandas DataFrame.

To answer the question of what majors are popular with women and which are not, let's take two columns in the df DataFrame: Major and ShareWomen. The Major column contains the list of majors, and the ShareWomen column includes the percentage of women who graduated from a particular major.

We said that series are one-dimensional. We will set the Major as the index to get our series, and ShareWomen will be the only column in the Series. We will then use the nsmallest and nlargest methods to get the rows with the lowest and highest percentages of women. Since an index is associated with a row, we get to know the majors that are least and most popular with women:

# set the subplots to be side-by-side
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20, 4), sharey=True)

# find the least popular majors with women and plot the bar chart
(
    df.set_index("Major")
    .ShareWomen
    .nsmallest(10)
    .plot(kind='bar', ax=axs[0], alpha=0.7, color="green", title="The majors with least Women", xlabel="")
)

# find the most popular majors with women and plot the bar chart
(
    df.set_index("Major")
    .ShareWomen
    .nlargest(10)
    .plot(kind='bar', ax=axs[1], alpha=0.7, color="olive", title="The majors with more Women", xlabel="")
)

plt.show()

Lucrative Majors and Gender Distribution

To find which majors are more lucrative, set the Major column name as the index of the DataFrame. Then, we will get a subset of the Series popular with men and women using .loc. After that, we will get only the Median column from the DataFrame. It creates a Series with Median income values, keeping the original index. We can visualize the results with a plot.

# set the subplots to be side-by-side
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20, 4), sharey=True)

# get the most popular majors with men and women
index_popular_women = df.set_index("Major").ShareWomen.nlargest(10).index
index_popular_men = (1 - df.set_index("Major").ShareWomen).nlargest(10).index

# get the median income for the majors popular with women
(
    df.set_index("Major")
    .loc[index_popular_women]
    .Median
    .sort_values()
    .plot(kind="bar", ax=axs[0], alpha=0.9, color="red", title="The median income of majors with more Women", xlabel="")
)

# get the median income for the majors popular with men
(
    df.set_index("Major")
    .loc[index_popular_men]
    .Median
    .sort_values()
    .plot(kind="bar", ax=axs[1], alpha=0.7, color="orange", title="The median income of majors with more Men", xlabel="")
)

plt.show()

We can analyze which majors have the highest and lowest unemployment rates or the most part-time jobs.

Salary bands

Imagine this messy dataset contains the most recent graduate job and salary information. You are in the process of picking your major and want to choose one with higher prospects of securing a well-paying job.

To perform this analysis, you will first divide the median salary for the major categories into bands. The Major_category column is like a school that contains several departments or majors. You will see the typical median starting salaries for people in these schools (pay attention to the square brackets):

(
    df
    .assign(salary_range=pd.cut(df.Median, bins=[0, 30_000, 50_000, 70_000, 100_000], labels=['<30k', '30-50k', '50-70k', '70k+']))
    .groupby(['salary_range', "Major_category"])["Median"].mean().unstack()
    .style.background_gradient(cmap='PuBu')
)

The assign method creates a new single column in addition to the original ones in df without altering it. In other words, the assign method helps us create a new DataFrame object. The pd.cut function divides the median salaries into brackets. The groupby method creates a multi-index object that groups the rows in the DataFrame into their salary_range and Major_category values. We get a multi-index Series to find the mean of the Median salaries. We use the unstack method to convert this multi-index series into a DataFrame.

We can see from the results the median salaries of major categories such as Engineering, Computers and Mathematics, and Law and Public Policy are above the 30k starting salary band.

We can see the variations in median salaries for selected major categories with the code block:

fig, ax = plt.subplots(nrows=1, ncols=5, figsize=(20, 6), sharey=True)
ax = ax.ravel()

for i, category in enumerate(["Engineering", "Computers & Mathematics", "Physical Sciences", "Law & Public Policy", "Business"]):
    for major_cat, group_code in df.groupby("Major_category")["Median"]:
        if category == major_cat:
            ax[i].violinplot(group_code, showmeans=True, showextrema=True)
            ax[i].set_title(major_cat)

plt.tight_layout()
plt.show()

High wages and low unemployment

Knowing that it pays well is not enough if you're considering a major. You also want to make sure that job opportunities are abundant and long-lasting. Therefore, it's crucial to identify the majors that offer high salaries and low unemployment rates.

(
    df
    .assign(
        earning_quantile=pd.qcut(df.groupby("Major")["Median"].transform(func=np.mean), q=[0.0, 0.25, 0.50, 0.90, 1.0], labels=[25, 50, 90, 100]),
        unemployment_quantile=pd.qcut(df.groupby("Major")["Unemployment_rate"].transform(func=np.mean), q=[0.0, 0.1, 0.50, 0.75, 1.0], labels=[10, 50, 75, 100])
    )
    .where(lambda x: (x.earning_quantile == 90) & (x.unemployment_quantile == 10))
    [["Major", "Median", "Unemployment_rate"]]
    .dropna()
    .assign(
        Unemployment_rate=lambda x: x.Unemployment_rate * 100
    )
    .set_index("Major")
    .style.background_gradient(cmap='PuBu')
)

In the above code, we created new single columns for earning_quantile and unemployment_quantile. These columns contain the median income and Unemployment rate numerical values in quantiles. The where method can filter the resulting DataFrame for rows with the median income in the 90th quantile and the unemployment rate in the 10th quantile.

We can see that majors like Mathematics & Computer Science and Botany have zero unemployment rates. We can also notice Petroleum Engineering, the major with the highest median salary, does not appear here. This is because this major has an unemployment rate above the 10th quantile. Why demand a high wage for a job that is uncertain or unstable?

The .str accessor and Correlation Plot

You can use the .str accessor with pandas to perform operations on strings. You can access all the methods and functions you use with Python strings. You can answer the following questions with the .str accessor:

# Which major contains the word "Science" with the code:
df.Major[df['Major'].str.contains('Science', case=False)]

# Which major has the longest name:
df.loc[df['Major'].str.len().idxmax(), 'Major']

# Which major ends with "studies"
df[df['Major'].str.endswith('STUDIES')].Major

You can also use pandas package to perform descriptive statistics like skewness, mean, median, and correlation. In the following code, we performed a correlation plot for a subset of the DataFrame:

# get the Pearson correlation matrix
corr_matrix = df.iloc[:, 6:-7].corr()

# get the upper triangle of the correlation matrix
upper_triangle = np.triu(corr_matrix)

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, mask=upper_triangle, annot=True, cmap='YlGnBu', fmt=".2f")
plt.title('Upper Triangular Correlation Heatmap')
plt.show()

Conclusion

This article delved into the capabilities of the pandas library for Python, demonstrating its effectiveness in handling and analyzing data. Through practical examples using the recent-grads csv file, we explored various aspects of data manipulation operations, including filtering and aggregation. Visualization libraries enhanced our understanding of the data, providing insights into salary trends, unemployment rates, and the popularity of college majors.

The article emphasized the importance of considering earnings potential and job security when choosing a major. The provided code snippets enable the techniques to extract meaningful insight from data.

For further practice with the pandas package for data preprocessing, data manipulation methods, data analysis, and data visualization, consider exploring the topics and projects available in our Python tracks: Pandas for Data Analysis and Introduction to Data Science learning tracks.

Related Hyperskill topics

Share this article
Get more articles
like this
Thank you! Your submission has been received!
Oops! Something went wrong.

Create a free account to access the full topic

Wide range of learning tracks for beginners and experienced developers
Study at your own pace with your personal study plan
Focus on practice and real-world experience
Andrei Maftei
It has all the necessary theory, lots of practice, and projects of different levels. I haven't skipped any of the 3000+ coding exercises.