Previously, we looked through the common ML pipeline. To recall, in an ML project we typically go through the following stages: data collection and preprocessing, exploratory data analysis (EDA), modeling, and deployment. In this topic, we will focus on modeling. You will learn the basic principles of training and using ML models with the scikit-learn library.
Basic setting
scikit-learn has lots of ML models to offer. The process of setting and training is the same for all of them:
- Import the model class, create an instance of it, and adjust the model parameters,
- Load the dataset and fit the model to it, in other words, train the model on the dataset,
- Use the model to predict values.
Let's look at the DummyRegressor model. The DummyRegressor is the simplest prediction model that can predict the mean, the median, the quantile of the target values of the training dataset, or the constant value that has been passed by a user. In real-world tasks, it serves merely as a basis for comparison with other models.
The process of working with a certain model in scikit-learn starts with importing the class:
from sklearn.dummy import DummyRegressor
Then, we create an instance of the class. We indicate parameters that are individual for every model in scikit-learn. In DummyRegressor, we can specify the strategy used to predict: the mean, the median, the quantile, or the constant value. If not specified, the training will continue with the default parameters (the default one here is 'mean').
dummy_regressor = DummyRegressor(strategy='median')Fitting the model
Now, we need to load the data and train the model on it. There are several formats in which we can load the data: NumPy arrays, SciPy matrices, and other data types that can be converted to a numeric format such as Pandas DataFrames. There are also a number of built-in datasets in the scikit-learn library. We will use one of them, the California house prices dataset. It consists of 20640 rows and 9 columns with observations of different houses in California and average prices.
We will use the data stored in these 9 columns for the training. We need to transform it into the training data (usually designated as X) and its training labels (designated as y). During the training, the data stored in y becomes our target that we want to predict based on the information stored in X.
In this dataset, X contains 8 features, such as the house age, the number of rooms, median income in the block, etc. The target array y represents one feature, a median value of houses for California districts in hundreds of thousands of dollars ($100,000). To load the dataset, we need to use the fetch_california_housing() method. Then we need to set the return_X_y parameter, so that the dataset will be returned divided into the feature matrix X and the training labels y.
from sklearn import datasets
X, y = datasets.fetch_california_housing(return_X_y=True)
Now, we need to fit the model on our dataset. To fit a model is to apply it to data so that the model will learn patterns from it. The models are often applied to newer observations similar to the ones we had as the training data. As a result of the training, the model should successfully describe relationships and patterns in the data.
scikit-learn provides the fit() method. The method usually requires two arguments: the feature matrix and its target array if we work with a supervised model. If we work with an unsupervised model, the training data will be enough. In unsupervised learning, we needn't annotate the data as we don't regard its labels while training. Here, we pass the variables X and y as arguments to the method:
dummy_regressor.fit(X, y)
# DummyRegressor(strategy='median')
Mind that fit() returns the DummyRegressor object with all the specified parameters represented as a string. The scikit-learn object trained with fit() has a number of model-specific attributes that can be called by adding an underscore to an attribute name. For example, our model has 2 attributes - .constant_ (mean or median or quantile of the training targets or some by user given value) and .n_outputs_ (number of outputs). Let's call one of them:
dummy_regressor.n_outputs_
# 1Prediction
Once we have trained the model we can use it to predict future observations. Usually, to check how well the model performs, the initial data is divided into two parts — the test set and the training set. We use the test set to compare the predicted values with the actual target values. We will discuss it more thoroughly in the following topics and hence won't follow this approach here.
To apply the trained model to a dataset, we need the predict() method. By calling the method, we get the predicted target values based on the training data. For our training data, stored in X, we get the following results:
dummy_regressor.predict(X)
# array([1.797, 1.797, 1.797, ..., 1.797, 1.797, 1.797])
So, the median price for all the houses equals 1.797.
We could also apply the trained model to some new data, though in the case of the DummyRegressor it doesn't make sense: this particular model will always return the predicted values for the dataset it has been trained on. Even though the DummyRegressor is never used in real tasks for any other purpose than comparison, it is a valid example of the use of the fit() and predict() methods.
Conclusion
In this topic, we went through the main stages of adjusting, training, and using an ML model in scikit-learn. Let's revise what we learned:
- The process of working with a model starts with importing the class with the corresponding name, creating an instance of that class, and specifying the required parameters. The parameters are specific for all models.
- Next, we load the dataset on which we want to train our model. Most often, it is passed as
NumPy arrays,SciPy matrices, andPandas DataFrames. - The training process is done with the aid of the
fit()method. - After training, we can use the model to predict values in future observations: in
scikit-learn, there is thepredict()method for this purpose.