Description
Your datasets are ready for merging! It will make data analysis much more straightforward.
Use concatenation to generate a dataset with information from both offices. Use the left merging by index to merge the previously created dataset with the HR's dataset. When joining, generate a column containing information about each row's origin. Keep only those employees' records that are present in both datasets.
The data structure of tables are the same as described in the previous stage:
Objectives
It's time to create one comprehensive dataset! After successful reindexing in previous stage, continue expanding your program by adding the following functionality :
Use the
concat()function frompandasto generate a unified dataset for offices A and B;Use
df.merge()to carry out the left merging of the unified office dataset with HR's dataset (the order of merging matters). Merge both datasets by index — use bothleft_indexandright_indexparameters. Setindicator=Trueindf.merge().For the final table, leave only those employees whose data is contained in both datasets. You can do this by evaluating for either null values of certain columns or look at the new column generated by the merge function;
Drop the
employee_office_id,employee_idcolumns and the new column that contains a row source (generated by the merge function);Sort the final dataset by index. Keep in mind, by sorting the data, a row with index
A100will come beforeA4because of string comparison;Print two Python lists: the final DataFrame index and the column names. Output each list on a separate line.
Example
Office A data (most of the columns are omitted)
number_project employee_office_id A2 5 2 A4 5 4 A6 2 6Office B data (most of the columns are omitted)
number_project employee_office_id B1 6 1 B3 4 3 B5 5 5HR data (most of the columns are omitted)
satisfaction_level employee_id employee_id A1 0.38 A1 A4 0.72 A4 B1 0.35 B1 B5 0.91 B5The resulting dataset (most of the columns are omitted)
number_project satisfaction_level A4 5 0.72 B1 6 0.35 B5 5 0.91
Output:
['A4', 'B1', 'B5']
['number_project', 'satisfaction_level']