Description
In this project, we will work on a classification algorithm that makes predictions when a dependent variable assumes discrete values. Logistic regression is arguably the simplest solution. In the case of binary classification (class 0 or class 1), it uses a sigmoid function to estimate how likely an observation belongs to class 1. It looks as follows:
where
is the bias; are the coefficients. You can fit logistic models without a bias term. In this case, would look like this:
To complete this stage, the CustomLogisticRegression class should contain the following attributes and methods:
class CustomLogisticRegression:
def __init__(self, fit_intercept=True, l_rate=0.01, n_epoch=100):
self.fit_intercept = ...
self.l_rate = ...
self.n_epoch = ...
def sigmoid(self, t):
return ...
def predict_proba(self, row, coef_):
t = ...
return self.sigmoid(t)fit_intercept attribute is set to True when the logistic model is fitted with a bias ( contains ). Otherwise, it is set to False, and the logistic model is fitted without any bias ( doesn't contain ). We will discuss the l_rate and n_epoch attributes in later stages.
The sigmoid method based on the formula above takes a value of t and returns a float in the range of . In this case, the predict_proba method:
Takes a row from a dataset and coefficients including a bias (designated as
coef_);Calculates
t;Calls the
sigmoidmethod and then returns its value.
The coef_ array is provided in this stage. The values for bias and coefficients are obtained from minimizing the cost function using gradient descent. You will be able to retrieve these values in later stages. Row is the only row in the dataframe. The row and coef_ are represented as and respectively in t. Remember that the bias is present only when fit_intercept is True. You can get t from row and coef_ with numpy.dot.
In this project, we will work with the Wisconsin Breast Cancer Dataset from the sklearn library. Learn carefully how to load and use dataset from the given link. The target variable is denoted by y. The matrix of selected independent variables will be referred to as X.
You may want to standardize the features if they are measured in different units. Suppose is a feature. Z-standardization can be calculated as follows:
is the standard score for the sample of the feature; is the value of the sample in ; and are the mean and standard deviation of , respectively. See an example below where the standard score is calculated with :
x | 1 | 3 | -3 | 4 | 5 | 6 | -2 |
z | -0.3118 | -0.3118 | -1.559 | 0.6236 | 0.9354 | 1.2472 | -1.2472 |
Objectives
Create the
CustomLogisticRegressionclass;Create the
__init__method;Create the
sigmoidmethod;Create the
predict_probamethod;Load the Breast Cancer Wisconsin dataset. Select
worst concave pointsandworst perimeteras features andtargetas the target variable;Standardize
X;Split the dataset including the target variable into training and test sets. Set
train_size=0.8andrandom_state=43;Given the coefficients below, calculate the probabilities of the first 10 rows in the test set. You don't need the training set in this stage;
Print these probabilities as a Python list.
Take the following values as bias and coefficients: [0.77001597, -2.12842434, -2.39305793]. It means that the model is fitted with an intercept, so t contains the bias term of 0.77001597; the coefficient of worst concave points is -2.12842434, and the coefficient of worst perimeter is -2.39305793.
Examples
Example 1: an example test set; features have been standardized
Standardized X_test and y_test data | ||
|
|
|
0.320904 | 0.230304 | 1.0 |
-1.743529 | -0.954428 | 1.0 |
1.014627 | 0.780857 | 0.0 |
1.432990 | -0.132764 | 0.0 |
Output:
[0.38601, 0.99885, 0.03703, 0.12322]Example 2: an example test set; features have been standardized
Standardized X_test and y_test data | ||
|
|
|
0.106398 | 0.646939 | 0.0 |
0.320904 | 0.075553 | 1.0 |
-0.244875 | -0.465477 | 1.0 |
0.646467 | -1.077931 | 1.0 |
Output:
[0.26804, 0.47657, 0.91722, 0.878]Example 3: an example test set; features have been standardized
Standardized X_test and y_test data | ||
|
|
|
-0.030521 | -0.231566 | 1.0 |
-0.899652 | -0.595824 | 1.0 |
0.918783 | -1.242799 | 1.0 |
0.792514 | 0.420765 | 0.0 |
Output:
[0.80045, 0.98387, 0.85675, 0.12745]