Description
In the final stage, you need to improve model performance by tuning the hyperparameters. No need to do it manually, as sklearn has convenient tools for this task. We advise using GridSearchCV. It searches through the specified parameter values for an estimator. Basically, it takes estimator, param_grid, and scoring as arguments. You can read about them in the documentation. As a starting point, you will be provided with a list of parameters to find a better set than the default one.
We urge you to try more parameter values to improve the result. The test system has minimum requirements on the algorithms and their accuracies. It can pass only two algorithms (K-nearest Neighbors and Random Forest) that performed in the best possible way in the previous stage. Concerning the scores, the test system requires values that can be achieved by finding the best set of parameters from the lists below.
Are you the one with the highest accuracy? Share your scores and the most efficient sets of parameters in the comments!
Objectives
- Choose a data representation that performed the best. You need to choose between the initial dataset and the one with normalized features. You also take only two models with the highest accuracy;
- Initialize
GridSearchCV(estimator=..., param_grid=..., scoring='accuracy', n_jobs=-1)to search over the following parameters:- For the K-nearest Neighbors classifier:
{'n_neighbors': [3, 4], 'weights': ['uniform', 'distance'], 'algorithm': ['auto', 'brute']} - For the Random Forest classifier:
{'n_estimators': [300, 500], 'max_features': ['sqrt', 'log2'], 'class_weight': ['balanced', 'balanced_subsample']}. Don't forget to setrandom_state=40in the Random Forest classifier!
Note that
njobsparameter is responsible for the number of jobs that will be run in parallel. Setnjobs=-1to use all processors; - For the K-nearest Neighbors classifier:
- Run the
fitmethod forGridSearchCV. Use the train set only. Since a number of models (one of two algorithms with a set of parameter values is one model) must be trained to compare the performances, this step will take about 30 minutes; - Print the best sets of parameters for both algorithms. You can get this information in the attribute called
best_estimator_of each algorithm'sGridSearchCVinstance. Train two best estimators on the test set and print their accuracies.
The input includes the train and test sets. The output consists of two classifiers that perform better than the ones from the previous stage. Provide the answer in the format shown below. The values are given for reference only, the actual numbers may be different.
Example
Example 1: an example of your output
K-nearest neighbours algorithm
best estimator: KNeighborsClassifier(n_neighbors=5, weights='uniform')
accuracy: 1
Random forest algorithm
best estimator: RandomForestClassifier(class_weight=None, max_features='sqrt',
n_estimators=100, random_state=40)
accuracy: 1