# Exploring Pandas Library for Python

The pandas library is a quick, robust, user-friendly tool for analyzing and manipulating data. It provides many methods and functions for efficiently working with tabular data, making it a crucial tool for data professionals and machine learning engineers with high performance.

The pandas library supports the import and export of data in various file formats, such as ** .csv**,

**,**

*.xslx***, and**

*.zip***. This flexibility enables pandas' integration into diverse data analysis pipelines, accommodating diverse data sources and destinations, which is essential for practical source data analysis..**

*.sql*This article will explore the pandas library; we will take the real-world data from *recent-grads.csv* file on College Majors. We will work with the two primary data structures of the Pandas library, Series and DataFrame, to provide answers to various questions. We will see how method chaining enables us to write concise, readable code. Recognizing the power of visualization, we will extensively employ visual aids to communicate our results.

## Import the Libraries and Data

In addition to the pandas library, we will mainly use the numpy library alongside the matplotlib and seaborn visualization libraries. The matploblib library is the default visualization library for pandas. However, you can change this to other popular visualization libraries. You can use their backend extension for those not expressly supported, such as Holoviews.

Install the Holoviews library:

Then import the libraries:

Several visualization libraries are interoperable with pandas version. The default is matplotlib; let’s show where to find this information and how to change it from the default.

Run the following code to view the default visualization library:

To change to your preferred backend:

To use a backend, you must understand how to visualize in that library. In this article, we will use the matplotlib backend.

We can import the data as a pandas DataFrame with the following code example:

The data has many rows and columns. We can not view all the columns and rows because of their optional parameters being set on default. Let’s change that for the columns with the code block:

The pandas DataFrame is in two dimensions, unlike the one-dimensional pandas Series. The DataFrame, therefore, has rows and columns. We have set the maximum number of columns to be displayed to the number of columns in the DataFrame.

Let's establish a consistent visualization style and background for all plots.

## Women and their majors

We will start with a pandas Series. We get a pandas series when we remove a single row from a pandas DataFrame.

To answer the question of what majors are popular with women and which are not, let's take two columns in the **df **DataFrame: **Major** and **ShareWomen**. The **Major** column contains the list of majors, and the **ShareWomen** column includes the percentage of women who graduated from a particular major.

We said that series are one-dimensional. We will set the **Major** as the index to get our series, and **ShareWomen** will be the only column in the Series. We will then use the **nsmallest **and **nlargest **methods to get the rows with the lowest and highest percentages of women. Since an index is associated with a row, we get to know the majors that are least and most popular with women:

like this

**Level up Pandas skills and advance your career**

• Wide range of learning tracks for beginners and experienced developers

• Study at your own pace with your personal study plan

• Focus on practice and real-world experience

## Lucrative Majors and Gender Distribution

To find which majors are more lucrative, set the **Major** column name as the index of the DataFrame. Then, we will get a subset of the Series popular with men and women using .loc. After that, we will get only the **Median** column from the DataFrame. It creates a Series with Median income values, keeping the original index. We can visualize the results with a plot.

We can analyze which majors have the highest and lowest unemployment rates or the most part-time jobs.

## Salary bands

Imagine this messy dataset contains the most recent graduate job and salary information. You are in the process of picking your major and want to choose one with higher prospects of securing a well-paying job.

To perform this analysis, you will first divide the median salary for the major categories into bands. The **Major_category** column is like a school that contains several departments or majors. You will see the typical median starting salaries for people in these schools (pay attention to the square brackets):

The **assign** method creates a new single column in addition to the original ones in **df** without altering it. In other words, the **assign** method helps us create a new DataFrame object. The **pd.cut** function divides the median salaries into brackets. The **groupby** method creates a multi-index object that groups the rows in the DataFrame into their **salary_range** and **Major_category** values. We get a multi-index Series to find the mean of the **Median** salaries. We use the **unstack** method to convert this multi-index series into a DataFrame.

We can see from the results the median salaries of major categories such as Engineering, Computers and Mathematics, and Law and Public Policy are above the 30k starting salary band.

We can see the variations in median salaries for selected major categories with the code block:

## High wages and low unemployment

Knowing that it pays well is not enough if you're considering a major. You also want to make sure that job opportunities are abundant and long-lasting. Therefore, it's crucial to identify the majors that offer high salaries and low unemployment rates.

In the above code, we created new single columns for **earning_quantile** and **unemployment_quantile**. These columns contain the median income and Unemployment rate numerical values in quantiles. The **where** method can filter the resulting DataFrame for rows with the median income in the 90th quantile and the unemployment rate in the 10th quantile.

We can see that majors like Mathematics & Computer Science and Botany have zero unemployment rates. We can also notice Petroleum Engineering, the major with the highest median salary, does not appear here. This is because this major has an unemployment rate above the 10th quantile. Why demand a high wage for a job that is uncertain or unstable?

## The .str accessor and Correlation Plot

You can use the **.str** accessor with pandas to perform operations on strings. You can access all the methods and functions you use with Python strings. You can answer the following questions with the **.str** accessor:

You can also use pandas package to perform descriptive statistics like skewness, mean, median, and correlation. In the following code, we performed a correlation plot for a subset of the DataFrame:

## Conclusion

This article delved into the capabilities of the pandas library for Python, demonstrating its effectiveness in handling and analyzing data. Through practical examples using the recent-grads csv file, we explored various aspects of data manipulation operations, including filtering and aggregation. Visualization libraries enhanced our understanding of the data, providing insights into salary trends, unemployment rates, and the popularity of college majors.

The article emphasized the importance of considering earnings potential and job security when choosing a major. The provided code snippets enable the techniques to extract meaningful insight from data.

For further practice with the pandas package for data preprocessing, data manipulation methods, data analysis, and data visualization, consider exploring the topics and projects available in our Python tracks: Pandas for Data Analysis and Introduction to Data Science learning tracks.