Description
If you have a linear regression with many variables, some of them may be correlated. This way, the performance of the model may decrease. A crucial step is to check the model for multicollinearity and exclude the variables with a strong correlation with other variables. Carry out this check, find the best model by removing the variables with high correlation, and return its MAPE score.
Objectives
- Read the data. For downloading the dataset refer to Stage 1;
- Load the data with
pandas.read_csv; - Calculate the correlation matrix for the numeric variables;
- Find the variables where the correlation coefficient is greater than 0.2. Hint: there should be three of them.
- Make
X, a DataFrame with all the predictor variables, andy, a series with the target. - Split the predictors and the target into training and test sets. Use
test_size=0.3andrandom_state=100— they guarantee that the results will be as expected. - Fit the linear models for salary prediction based on the subsets of other variables. The subsets are as follows:
- First, try to remove each of the three variables you've found in step 4.
- Second, remove each possible pair of these three variables.
a,b, andc, then first you fit a model whereais removed, then a model withoutb, and then the model withoutc. After this, you estimate the model without bothaandb, then without bothbandc, and last, without bothaandc. As a result, you will have six models to choose the best from. - Make predictions and print the lowest MAPE. The MAPE is a floating number rounded to five decimal places.
Example
Example 1: program output
1.11223