Theory
In this project, you'll be working with datasets represented in the XML format. XML is an eXtensible Markup Language, one of the most popular formats for storing and exchanging data. Let's look at an example:
<data>
<row>
<number_project>5</number_project>
<average_monthly_hours>262</average_monthly_hours>
<time_spend_company>6</time_spend_company>
<Work_accident>0</Work_accident>
<promotion_last_5years>0</promotion_last_5years>
<Department>sales</Department>
<salary>medium</salary>
<employee_office_id>2</employee_office_id>
</row>Yeah... Not user-friendly, right? The good news is that pandas has a function to read and process the XML format in one code line.
Description
Your HR boss gave you three datasets. The first two are from different offices: A and B (A_office_data.xml and B_office_data.xml, respectively); the third is the HR dataset (hr_data.xml). The guy wants you to investigate the data. The first thing you need to do is to check and reindex it for further stages.
The HR boss has also supplied you with some descriptions:
A_office_data.xml and B_office_data.xml:
number_project— number of projects an employee has worked on;average_monthly_hours— typical workload per month in hours;time_spend_company— how many years an employee has worked in the company;Work_accident— whether an employee has had an injury at work;promotion_last_5years— whether an employee has had any promotions during the last five years;Department— employee's department;salary— employee's salary rate;employee_office_id— employee's ID (1, 2, 3, etc.).
hr_data.xml:
satisfaction_level— how well an employee performs their job;last_evaluation— the last evaluation score of an employee;left— whether an employee has left the company;employee_id— employee's ID in the company (A125— from the A office;125in this case, isemployee_office_id).
We have automated the data download process in the .py file provided to you. The files will be downloaded in the Data folder, so make sure you use the correct file path to access the files. However, if that is inconvenient, please download the datasets: A_office_data.xml, B_office_data.xml, and hr_data.xml.
Objectives
In this stage, load the datasets, study and reindex them.
Load all three datasets. Remember, data is stored in the XML format. By default, the files are loaded in the
Datafolder, so you can use theread_xml('../Data/' + file_name)function frompandas;Explore the data. Check what the axes are, get the shapes of the DataFrames, use
df.info()to check whether there are null values, and overview the data types;Reindex all three datasets. It is required because some of the
employee_office_idcolumn values for offices A and B are same, which results in data conflict while merging in future stages. For HR data, use theemployee_idcolumn as the index. For offices A and B, use the name of the office and theemployee_office_idcolumn to create indexes. For example, for office A, index of employee #125 will beA125. The offices' data index should resemble HR's data index.Print three Python lists containing office A, B, and HR data indexes.
Example
Before reindexing (most of the columns are omitted to avoid cluttering):
A_office_data.xml
number_project employee_office_id 0 5 2 1 5 4 2 2 6 3 6 7 4 5 9B_office_data.xml
number_project employee_office_id 0 6 1 1 4 3 2 5 5 3 2 6 4 3 11hr_data.xml
satisfaction_level employee_id 0 0.38 A1 1 0.11 A3 2 0.72 A4 3 0.37 A5 4 0.10 A7
After reindexing (most of the columns are omitted to avoid cluttering):
A_office_data.xml
number_project employee_office_id A2 5 2 A4 5 4 A6 2 6 A7 6 7 A9 5 9B_office_data.xml
number_project employee_office_id B1 6 1 B3 4 3 B5 5 5 B6 2 6 B11 3 11hr_data.xml
satisfaction_level employee_id employee_id A1 0.38 A1 A3 0.11 A3 A4 0.72 A4 A5 0.37 A5 A7 0.10 A7
Output:
['A2', 'A4', 'A6', 'A7', 'A9']
['B1', 'B3', 'B5', 'B6', 'B11']
['A1', 'A3', 'A4', 'A5', 'A7']