Stage implement · Load the data and modify the indexes

HR Data Analysis. Stage 1/5

Load the data and modify the indexes

Report a typo

Theory

In this project, you'll be working with datasets represented in the XML format. XML is an eXtensible Markup Language, one of the most popular formats for storing and exchanging data. Let's look at an example:

<data>
  <row>
    <number_project>5</number_project>
    <average_monthly_hours>262</average_monthly_hours>
    <time_spend_company>6</time_spend_company>
    <Work_accident>0</Work_accident>
    <promotion_last_5years>0</promotion_last_5years>
    <Department>sales</Department>
    <salary>medium</salary>
    <employee_office_id>2</employee_office_id>
  </row>

Yeah... Not user-friendly, right? The good news is that pandas has a function to read and process the XML format in one code line.

Description

Your HR boss gave you three datasets. The first two are from different offices: A and B (A_office_data.xml and B_office_data.xml, respectively); the third is the HR dataset (hr_data.xml). The guy wants you to investigate the data. The first thing you need to do is to check and reindex it for further stages.

The HR boss has also supplied you with some descriptions:

A_office_data.xml and B_office_data.xml:
- number_project — number of projects an employee has worked on;
- average_monthly_hours — typical workload per month in hours;
- time_spend_company — how many years an employee has worked in the company;
- Work_accident — whether an employee has had an injury at work;
- promotion_last_5years — whether an employee has had any promotions during the last five years;
- Department — employee's department;
- salary — employee's salary rate;
- employee_office_id — employee's ID (1, 2, 3, etc.).
hr_data.xml:
- satisfaction_level — how well an employee performs their job;
- last_evaluation — the last evaluation score of an employee;
- left — whether an employee has left the company;
- employee_id — employee's ID in the company (A125 — from the A office; 125 in this case, is employee_office_id).

We have automated the data download process in the .py file provided to you. The files will be downloaded in the Data folder, so make sure you use the correct file path to access the files. However, if that is inconvenient, please download the datasets: A_office_data.xml, B_office_data.xml, and hr_data.xml.

Objectives

In this stage, load the datasets, study and reindex them.

Load all three datasets. Remember, data is stored in the XML format. By default, the files are loaded in the Data folder, so you can use the read_xml('../Data/' + file_name) function from pandas;
Explore the data. Check what the axes are, get the shapes of the DataFrames, use df.info() to check whether there are null values, and overview the data types;
Reindex all three datasets. It is required because some of the employee_office_id column values for offices A and B are same, which results in data conflict while merging in future stages. For HR data, use the employee_id column as the index. For offices A and B, use the name of the office and the employee_office_id column to create indexes. For example, for office A, index of employee #125 will be A125. The offices' data index should resemble HR's data index.
Print three Python lists containing office A, B, and HR data indexes.

Example

Before reindexing (most of the columns are omitted to avoid cluttering):

A_office_data.xml

      number_project  employee_office_id
0                  5                   2
1                  5                   4
2                  2                   6
3                  6                   7
4                  5                   9

B_office_data.xml

      number_project  employee_office_id
0                  6                   1
1                  4                   3
2                  5                   5
3                  2                   6
4                  3                  11

hr_data.xml

      satisfaction_level employee_id
0                   0.38          A1
1                   0.11          A3
2                   0.72          A4
3                   0.37          A5
4                   0.10          A7

After reindexing (most of the columns are omitted to avoid cluttering):

A_office_data.xml

       number_project  employee_office_id
A2                  5                   2
A4                  5                   4
A6                  2                   6
A7                  6                   7
A9                  5                   9

B_office_data.xml

       number_project  employee_office_id
B1                  6                   1
B3                  4                   3
B5                  5                   5
B6                  2                   6
B11                 3                  11

hr_data.xml

             satisfaction_level employee_id
employee_id
A1                         0.38          A1
A3                         0.11          A3
A4                         0.72          A4
A5                         0.37          A5
A7                         0.10          A7

Output:

['A2', 'A4', 'A6', 'A7', 'A9']
['B1', 'B3', 'B5', 'B6', 'B11']
['A1', 'A3', 'A4', 'A5', 'A7']

Write a program

IDE100

___

Create a free account to access the full topic

Topics in stage