Salary Prediction. Stage 4/5

Test for multicollinearity and variables selection

Report a typo

Description

If you have a linear regression with many variables, some of them may be correlated. This way, the performance of the model may decrease. A crucial step is to check the model for multicollinearity and exclude the variables with a strong correlation with other variables. Carry out this check, find the best model by removing the variables with high correlation, and return its MAPE score.

Objectives

  1. Read the data. For downloading the dataset refer to Stage 1;
  2. Load the data with pandas.read_csv;
  3. Calculate the correlation matrix for the numeric variables;
  4. Find the variables where the correlation coefficient is greater than 0.2. Hint: there should be three of them.
  5. Make X, a DataFrame with all the predictor variables, and y, a series with the target.
  6. Split the predictors and the target into training and test sets. Use test_size=0.3 and random_state=100 — they guarantee that the results will be as expected.
  7. Fit the linear models for salary prediction based on the subsets of other variables. The subsets are as follows:
    • First, try to remove each of the three variables you've found in step 4.
    • Second, remove each possible pair of these three variables.
    For example, if you have found out that the highly correlated variables are a, b, and c, then first you fit a model where a is removed, then a model without b, and then the model without c. After this, you estimate the model without both a and b, then without both b and c, and last, without both a and c. As a result, you will have six models to choose the best from.
  8. Make predictions and print the lowest MAPE. The MAPE is a floating number rounded to five decimal places.

Example

Example 1: program output

1.11223
Write a program
IDE integration
Checking the IDE status
___

Create a free account to access the full topic