You're already familiar with the linear regression method, but often there's no linear dependency between variables. Here's when the polynomial regression method comes in handy. It is used when you have interactions between variables — for example, when you have two risk factors for the disease which can affect each other and worsen the prognosis. So, the polynomial regression method is applied to make predictions in complex systems like medicine, biology, or economics.
Polynomial models can be formulated like this:
While fitting the model, you find optimal coefficients to minimize the function called MSE:
MSE is the average squared distance between the model's predictions and real values. If we minimize it, we tend to get more accurate predictions, but we must be cautious about overfitting. Due to the second degree in the formula, we'll get MSE in squared units like squared dollars, so we often use RMSE:
Growing bacteria: an example
For this topic, we'll consider an example from biology. Imagine that you're a microbiologist growing some bacteria colonies in fancy Petri dishes. You want to know how many cells you'll have in future experiments. You have historical data about the concentration of sugar, time of growth, and the number of cells.
So, you want to predict the number of cells based on the parameters given above. As you can see from the plots, the dependence between each feature and the target doesn't look like a straight line, so a linear regression model won't give you good results.
Let's split our task into two steps: first, we'll make predictions based on the growing time, and then we'll consider both features.
Parameters of the model
Before setting up a polynomial regression model, we need to determine its hyperparameters:
include_bias(True or False) — if this parameter is True, you have a non-zero intercept.order('C' or 'F') comes into play when we're trying to make computations faster.interactions_only(True or False) — if it is set as True, your dataset will create new features only from combinations. In our example, it's sugar*time, not sugar*sugar or time*time.degree— you can pass a single integer (will act as maximum degree) or a pair of integers (minimum and maximum degree of polynomial). Usually, we choose parameters from prior knowledge (like modeling some physics processes where we have a formula), test different degrees, and choose the one with the best score. In our bacteria example, we can estimate the degree by looking at the plot.
Here you can see two models with predictions made on the same data. The first model was set with degree = 10. It illustrates overfitting in a polynomial regression model. The second one has degree = 2.
Considering a single feature
For the first task, let's make another dataset:
one_feature = data.drop('sugar', axis = 1)
In this dataset, we have features X (time) and target y (number of cells):
X = one_feature['time']
y = one_feature['cells']
Let's create our model. Firstly, we need to import PolynomialFeatures from the preprocessing module:
from sklearn.preprocessing import PolynomialFeatures
As we saw in the plots, our data points form a parabola, so let's set the degree equal to 2. We also assume that at the start we had no cells in the dish, so the bias equals zero. We initialize the model with these parameters:
poly = PolynomialFeatures(degree=2, include_bias=False)
Now that we initialized our model, let's transform its features. After the .fit_transform() method, we'll get columns and .
transformed_features = poly.fit_transform(X.reshape(-1, 1))
| 1.788937 | 3.200295 |
| 2.803923 | 7.861982 |
| 5.415256 | 29.325000 |
| 5.269010 | 27.762469 |
.fit_transform() method requires a 2D array, so we transform X with the .reshape() method.Let's split our dataset into train and test groups.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(transformed_features, y,
train_size=0.75, random_state=42)Making predictions
Now we are ready to fit our model with transformed data and predict target values. Here you can see that polynomial regression is a linear model if you consider as , as , and so on.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
We can visualize the results with the following code:
plt.figure(figsize=(9, 6))
plt.title("Polynomial regression")
plt.scatter(X_test[:, 0], y_test)
xs, ys = zip(*sorted(zip(X_test[:, 0], y_pred)))
plt.plot(xs, ys, c="green")
plt.xlabel('time')
plt.ylabel('cells')
plt.show()
Let's calculate RMSE for our model:
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_pred, squared = False))
It is equal to 18.42. With a linear regression model, we'd have RMSE equal to 31.37.
Considering all features
Now let's move on to the second step and look at the case with two features. The algorithm would be the same.
Let's set the features and the target dataframes to work with.
X = data.drop('cells', axis = 1)
y = data['cells']
Firstly, we need to initialize the transformer. We have multiple variables, so we need to determine the interaction_only parameter. For our model, it should be False, and we want to find dependence on and as well.
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only = False)
transformed_features = poly.fit_transform(X)
Here we can take a look at the first row and understand how transformation works.
Before the transformation:
| 1.788937 | 1.490142 |
After the transformation:
|
0
|
1
|
2
|
3
|
4
|
|---|---|---|---|---|
| 1.788937 | 1.490142 | 3.200295 | 2.665771 | 2.220525 |
We have 3 new columns corresponding to the new features.
So, let's train our model and make predictions as we did in previous steps.
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(mean_squared_error(y_test, y_pred, squared = False))
RMSE for our model is equal to 3.16. But if we'd try to make predictions with the linear regression model, we'd get RMSE equal to 27.98.
Conclusion
- With a polynomial regression model, you can make predictions with non-linear dependencies between variables.
- The polynomial regression pipeline consists of two parts. Firstly, you transform the dataset and make new features with
PolynomialFeatures(). After that, you set the linear regression model. - To initialize
PolynomialFeatures, you need to determine the maximum polynomial degree, the presence of bias, and interactions between the same variables. - Polynomial regression models tend to overfit, so you must be cautious when setting the maximum degree.