Computer scienceData scienceInstrumentsScikit-learnTraining ML models with scikit-learnRegression in scikit-learn

Polynomial Regression with scikit-learn

10 minutes read

You're already familiar with the linear regression method, but often there's no linear dependency between variables. Here's when the polynomial regression method comes in handy. It is used when you have interactions between variables — for example, when you have two risk factors for the disease which can affect each other and worsen the prognosis. So, the polynomial regression method is applied to make predictions in complex systems like medicine, biology, or economics.

Polynomial models can be formulated like this:

$y = \alpha_0 + \alpha_1 x + \alpha_2 x^2 +... + \alpha_n x^n$

While fitting the model, you find optimal coefficients $\alpha_1,..., \alpha_n$ to minimize the function called MSE:

$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \xrightarrow{}{} \underset{\alpha_1,...,\alpha_n}{\min}$

MSE is the average squared distance between the model's predictions and real values. If we minimize it, we tend to get more accurate predictions, but we must be cautious about overfitting. Due to the second degree in the formula, we'll get MSE in squared units like squared dollars, so we often use RMSE:

$RMSE = \sqrt{MSE}$

Growing bacteria: an example

For this topic, we'll consider an example from biology. Imagine that you're a microbiologist growing some bacteria colonies in fancy Petri dishes. You want to know how many cells you'll have in future experiments. You have historical data about the concentration of sugar, time of growth, and the number of cells.

Scatter plot of the original dataset illustrating dependencies between cell number (Y axis) and time or sugar respectively (X axis).

So, you want to predict the number of cells based on the parameters given above. As you can see from the plots, the dependence between each feature and the target doesn't look like a straight line, so a linear regression model won't give you good results.

Let's split our task into two steps: first, we'll make predictions based on the growing time, and then we'll consider both features.

Parameters of the model

Before setting up a polynomial regression model, we need to determine its hyperparameters:

include_bias (True or False) — if this parameter is True, you have a non-zero intercept.
order ('C' or 'F') comes into play when we're trying to make computations faster.
interactions_only (True or False) — if it is set as True, your dataset will create new features only from combinations. In our example, it's sugar*time, not sugar*sugar or time*time.
degree — you can pass a single integer (will act as maximum degree) or a pair of integers (minimum and maximum degree of polynomial). Usually, we choose parameters from prior knowledge (like modeling some physics processes where we have a formula), test different degrees, and choose the one with the best score. In our bacteria example, we can estimate the degree by looking at the plot.

If you set the degree too high, you'll fall into the overfitting trap, so your model's ability to predict target values will decrease. Remember that you can perfectly describe

n

points of data with an

n - 1

degree polynomial.

Here you can see two models with predictions made on the same data. The first model was set with degree = 10. It illustrates overfitting in a polynomial regression model. The second one has degree = 2.

Well-fitted and overfitted Polynomial Regression models.

Considering a single feature

For the first task, let's make another dataset:

one_feature = data.drop('sugar', axis = 1)

In this dataset, we have features X (time) and target y (number of cells):

X = one_feature['time']
y = one_feature['cells']

Let's create our model. Firstly, we need to import PolynomialFeatures from the preprocessing module:

from sklearn.preprocessing import PolynomialFeatures

As we saw in the plots, our data points form a parabola, so let's set the degree equal to 2. We also assume that at the start we had no cells in the dish, so the bias equals zero. We initialize the model with these parameters:

poly = PolynomialFeatures(degree=2, include_bias=False)

Now that we initialized our model, let's transform its features. After the .fit_transform() method, we'll get columns $\text{sugar}$ and $\text{sugar}^2$ .

transformed_features = poly.fit_transform(X.reshape(-1, 1))

$x$	$x^2$
1.788937	3.200295
2.803923	7.861982
5.415256	29.325000
5.269010	27.762469

.fit_transform() method requires a 2D array, so we transform X with the .reshape() method.

Let's split our dataset into train and test groups.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_features, y, 
                                                   train_size=0.75, random_state=42)

Making predictions

Now we are ready to fit our model with transformed data and predict target values. Here you can see that polynomial regression is a linear model if you consider $x$ as $x_1$ , $x^2$ as $x_2$ , and so on.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

We can visualize the results with the following code:

plt.figure(figsize=(9, 6))
plt.title("Polynomial regression")
plt.scatter(X_test[:, 0], y_test)
xs, ys = zip(*sorted(zip(X_test[:, 0], y_pred)))
plt.plot(xs, ys, c="green")
plt.xlabel('time')
plt.ylabel('cells')
plt.show()

Plot of the original data distribution and Polynomial Regression model.

Let's calculate RMSE for our model:

from sklearn.metrics import mean_squared_error

print(mean_squared_error(y_test, y_pred, squared = False))

It is equal to 18.42. With a linear regression model, we'd have RMSE equal to 31.37.

Considering all features

Now let's move on to the second step and look at the case with two features. The algorithm would be the same.

If you're working with two or more features, you may consider data standardization. Otherwise, regression models will pay more attention to the bigger values. But in our case, we have comparable values, so their impact on the result will be the same.

Let's set the features and the target dataframes to work with.

X = data.drop('cells', axis = 1)
y = data['cells']

Firstly, we need to initialize the transformer. We have multiple variables, so we need to determine the interaction_only parameter. For our model, it should be False, and we want to find dependence on $\text{time}^2$ and $\text{sugar}^2$ as well.

poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only = False)
transformed_features = poly.fit_transform(X)

Here we can take a look at the first row and understand how transformation works.

Before the transformation:

$\text{time}$	$\text{sugar}$
1.788937	1.490142

After the transformation:

0 $\text{time}$	1 $\text{sugar}$	2 $\text{time}^2$	3 $\text{time} \cdot \text{sugar}$	4 $\text{sugar}^2$
1.788937	1.490142	3.200295	2.665771	2.220525

$\text{time}$

$\text{sugar}$

$\text{time}^2$

$\text{time} \cdot \text{sugar}$

$\text{sugar}^2$

1.788937

1.490142

3.200295

2.665771

2.220525

We have 3 new columns corresponding to the new features.

So, let's train our model and make predictions as we did in previous steps.

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(mean_squared_error(y_test, y_pred, squared = False))

RMSE for our model is equal to 3.16. But if we'd try to make predictions with the linear regression model, we'd get RMSE equal to 27.98.

Conclusion

With a polynomial regression model, you can make predictions with non-linear dependencies between variables.
The polynomial regression pipeline consists of two parts. Firstly, you transform the dataset and make new features with PolynomialFeatures(). After that, you set the linear regression model.
To initialize PolynomialFeatures, you need to determine the maximum polynomial degree, the presence of bias, and interactions between the same variables.
Polynomial regression models tend to overfit, so you must be cautious when setting the maximum degree.

How did you like the theory?

Report a typo