Description
At this stage, you need to use sklearn to split your data into train and test sets. We will use only a portion of the dataset to process model training faster in the next stage.
It is crucial to ensure a balanced training dataset after splitting, as an insufficient amount of data for training recognition of certain digits can negatively impact the quality of our model. We will check this point as well.
Objectives
- Import a necessary tool from
sklearn; - Use the first 6000 rows of the datasets. Set the test set size as
0.3and the random seed of40to make your output reproducible; - Print new datasets' shapes.
- Let's make sure that our dataset is balanced after splitting. Print the proportions of samples per class in the training set as in the example below. We recommend using
pd.Series.value_counts(normalize=True).
The input of train-test splitting function is two arrays: features and target. The output contains four arrays stored in the following variables: x_train — the features' train set, x_test — the features' test set, y_train — a target variable from the train set, y_test — a target variable from the test set.
Provide the answers to the questions in the format as shown below.
Example
Example 1: an example of the output
x_train shape: (100, 5)
x_test shape: (20, 5)
y_train shape: (100,)
y_test shape: (20,)
Proportion of samples per class in train set:
1 0.05
7 0.30
4 0.25
6 0.04
0 0.08
9 0.01
3 0.13
2 0.10
8 0.01
5 0.03
dtype: float64