HR Data Analysis. Stage 2/5

Merge everything

Report a typo

Description

Your datasets are ready for merging! It will make data analysis much more straightforward.

Use concatenation to generate a dataset with information from both offices. Use the left merging by index to merge the previously created dataset with the HR's dataset. When joining, generate a column containing information about each row's origin. Keep only those employees' records that are present in both datasets.

The data structure of tables are the same as described in the previous stage:

Objectives

It's time to create one comprehensive dataset! After successful reindexing in previous stage, continue expanding your program by adding the following functionality :

  1. Use the concat() function from pandas to generate a unified dataset for offices A and B;

  2. Use df.merge() to carry out the left merging of the unified office dataset with HR's dataset (the order of merging matters). Merge both datasets by index — use both left_index and right_index parameters. Set indicator=True in df.merge().

  3. For the final table, leave only those employees whose data is contained in both datasets. You can do this by evaluating for either null values of certain columns or look at the new column generated by the merge function;

  4. Drop the employee_office_id, employee_id columns and the new column that contains a row source (generated by the merge function);

  5. Sort the final dataset by index. Keep in mind, by sorting the data, a row with index A100 will come before A4 because of string comparison;

  6. Print two Python lists: the final DataFrame index and the column names. Output each list on a separate line.

Example

  • Office A data (most of the columns are omitted)

           number_project  employee_office_id
    A2                  5                   2
    A4                  5                   4
    A6                  2                   6
  • Office B data (most of the columns are omitted)

           number_project  employee_office_id
    B1                  6                   1
    B3                  4                   3
    B5                  5                   5
  • HR data (most of the columns are omitted)

                 satisfaction_level employee_id
    employee_id
    A1                         0.38          A1
    A4                         0.72          A4
    B1                         0.35          B1
    B5                         0.91          B5
  • The resulting dataset (most of the columns are omitted)

        number_project  satisfaction_level
    A4               5                0.72
    B1               6                0.35
    B5               5                0.91

Output:

['A4', 'B1', 'B5']
['number_project', 'satisfaction_level']
Write a program
IDE integration
Checking the IDE status
___

Create a free account to access the full topic