Classification of Handwritten Digits. Stage 2/5

Split into sets

Report a typo

Description

At this stage, you need to use sklearn to split your data into train and test sets. We will use only a portion of the dataset to process model training faster in the next stage.

It is crucial to ensure a balanced training dataset after splitting, as an insufficient amount of data for training recognition of certain digits can negatively impact the quality of our model. We will check this point as well.

Objectives

  1. Import a necessary tool from sklearn;
  2. Use the first 6000 rows of the datasets. Set the test set size as 0.3 and the random seed of 40 to make your output reproducible;
  3. Print new datasets' shapes.
  4. Let's make sure that our dataset is balanced after splitting. Print the proportions of samples per class in the training set as in the example below. We recommend using pd.Series.value_counts(normalize=True).

The input of train-test splitting function is two arrays: features and target. The output contains four arrays stored in the following variables: x_train — the features' train set, x_test — the features' test set, y_train — a target variable from the train set, y_test — a target variable from the test set.

Provide the answers to the questions in the format as shown below.

Example

Example 1: an example of the output

x_train shape: (100, 5)
x_test shape: (20, 5)
y_train shape: (100,)
y_test shape: (20,)
Proportion of samples per class in train set:
1    0.05
7    0.30
4    0.25
6    0.04
0    0.08
9    0.01
3    0.13
2    0.10
8    0.01
5    0.03
dtype: float64
Write a program
IDE integration
Checking the IDE status
___

Create a free account to access the full topic