Classification of Handwritten Digits. Stage 5/5

Hyperparameter tuning

Report a typo

Description

In the final stage, you need to improve model performance by tuning the hyperparameters. No need to do it manually, as sklearn has convenient tools for this task. We advise using GridSearchCV. It searches through the specified parameter values for an estimator. Basically, it takes estimator, param_grid, and scoring as arguments. You can read about them in the documentation. As a starting point, you will be provided with a list of parameters to find a better set than the default one.

We urge you to try more parameter values to improve the result. The test system has minimum requirements on the algorithms and their accuracies. It can pass only two algorithms (K-nearest Neighbors and Random Forest) that performed in the best possible way in the previous stage. Concerning the scores, the test system requires values that can be achieved by finding the best set of parameters from the lists below.

Are you the one with the highest accuracy? Share your scores and the most efficient sets of parameters in the comments!

Objectives

  1. Choose a data representation that performed the best. You need to choose between the initial dataset and the one with normalized features. You also take only two models with the highest accuracy;
  2. Initialize GridSearchCV(estimator=..., param_grid=..., scoring='accuracy', n_jobs=-1) to search over the following parameters:
    • For the K-nearest Neighbors classifier: {'n_neighbors': [3, 4], 'weights': ['uniform', 'distance'], 'algorithm': ['auto', 'brute']}
    • For the Random Forest classifier: {'n_estimators': [300, 500], 'max_features': ['sqrt', 'log2'], 'class_weight': ['balanced', 'balanced_subsample']}. Don't forget to set random_state=40 in the Random Forest classifier!

    Note that njobs parameter is responsible for the number of jobs that will be run in parallel. Set njobs=-1 to use all processors;

  3. Run the fit method for GridSearchCV. Use the train set only. Since a number of models (one of two algorithms with a set of parameter values is one model) must be trained to compare the performances, this step will take about 30 minutes;
  4. Print the best sets of parameters for both algorithms. You can get this information in the attribute called best_estimator_ of each algorithm's GridSearchCV instance. Train two best estimators on the test set and print their accuracies.

The input includes the train and test sets. The output consists of two classifiers that perform better than the ones from the previous stage. Provide the answer in the format shown below. The values are given for reference only, the actual numbers may be different.

Example

Example 1: an example of your output

K-nearest neighbours algorithm
best estimator: KNeighborsClassifier(n_neighbors=5, weights='uniform')
accuracy: 1

Random forest algorithm
best estimator: RandomForestClassifier(class_weight=None, max_features='sqrt',
                       n_estimators=100, random_state=40)
accuracy: 1
Write a program
IDE integration
Checking the IDE status
___

Create a free account to access the full topic