The aim of machine learning is to predict data based on other data. When building a model we want to be sure that the final solution will work fine on new data samples. It means that we can trust its predictions and make decisions based on them. Train-test splitting is a vital step to satisfy this requirement.
Definition of sets
The names train and test sets speak for themselves. The first one is to train the model, the second one is to test it. Explaining in more detail, an algorithm learns the coefficients and rules to predict the outcome with the training dataset. When the fitting process is completed we want to evaluate how good our model is. It doesn't make sense to evaluate the algorithm on the same dataset. A model can simply learn all the outcomes and be always 100% accurate. If we want a real-situation assessment, we need new data. Here comes the test dataset, which the model didn't "see" during the training process, and it makes evaluation relatively objective.
Place in ML pipeline
In the picture, you see that the train-test split is a part of the model selection stage. We also have a validation set, which is held out to evaluate model fitting with different hyperparameters (the validation set could be thought of as a further split of the train set into two). Hyperparameters refer to the external settings that are specified before training to control the model's behavior (for example, in random forest, the number of trees to include is a hyperparameter that has to be chosen).
Note that back-and-forth arrows between "fit" and "evaluate" show an iteration of the training process: you set a model's hyperparameters, fit the model, and evaluate the prediction on the validation set, with the final evaluation being done on the test set. Then you select other hyperparameters, train the model again and evaluate on the validation set to check if the new setting works better.
We can't use the test set to choose different hyperparameters because it will result in data leakage: the model will learn the values from the test set indirectly, leading to poor generalization and low prediction ability on new, previously unseen data. Thus, it's always a good idea to actually have at least three sets: train, validation, and test, unless you are dealing with extremely low data availability. Also, one could use a technique such as cross-validation to mitigate the test leakage into the train, which will be discussed in the upcoming topics.
In case you are not performing any hyperparameter tuning, a simple train-test set will be enough.
sklearn tool
Beloved sklearn has a tool to make such a split. Firstly, we will load a toy dataset.
from sklearn.datasets import load_wine
data = load_wine(as_frame=True)["frame"]
X, y = data.iloc[:, :-1], data["target"]
Now X is a 2D array of features and y is a target variable. Secondly, we make a split using train_test_split() imported from sklearn.model_selection module:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
The method divides given arrays/DataFrames/lists into two parts. Therefore it returns twice as many arrays as were passed in the arguments.
Let's discuss available parameters:
*arraysare arrays to be split. One can pass multiple array-like structures, the accepted types are lists,numpyarrays,scipy.sparsematrices, and pandas dataframes. In our case, we pass onlyXandy. The code below will also work:train_test_split(X, y, data, train_size=0.8, random_state=42)train_sizeis a proportion of an array to mark as a train set.test_sizeis the other way around: it sets a proportion of an array to mark as a test set. You choose one of these parameters. Their sum always equals .
In our case, the dataset has rows andtrain_size=0.8. Since , the train set consists of rows and the test one consist of rows.random_statecontrols random shuffling of the rows before the split. Pass any integer, if you need a reproducible output.shuffleisTrueby default and it controls whether or not to shuffle the data before splitting.
Lastly, there is another quite important parameter, stratify, which ensures that the train and test sets will be representative of the class distribution in the original dataset, which becomes crucial when dealing with an unequal class distribution. Without setting the stratify (None by default) to an array of labels, you might end up in a scenario where certain classes are only present in either the train or the test set. The following line ensures that both the train and the test will have the same ratio of classes as the original set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)A note on fit_transform()
.fit(), .fit_transform(), and .transform() are the methods of the estimator API in sklearn. The estimator is .fit() only on the train set, while .transform() could be applied to both the training and the test sets. .fit_transform() is the optimized combination of the two methods, equivalent to fit(X_train).transform(X_train).
.transform() is utilized to transform the input data. Once the .fit() method has carried out the fitting procedure, .transform() then imposes the transformation on a specific dataset. This could involve feature scaling operations, like standardization or normalization, or a process of dimensionality reduction.
The proportion of train / test data
You may wonder if there is a rule to set the ratio of the train set to the test set. The truth is that there is no rule, only an intuition that you need more data to learn than to test your knowledge. The train set is usually between 60-80% of the data. In fact, the less the dataset is, the more should be a proportion of the data the model will be trained on.
Conclusion
This topic covers basics about dividing the data into a set the model will be trained on and a set it will be tested on. Recall that you make a split only once to preserve equal conditions for all the algorithms you're going to try. Use train_test_split() method to make a split in a single line of code.
Let's practice!