Support Vector Machines (SVM) is a robust machine learning algorithm that can be used for classification and regression analysis. In this topic, we will focus specifically on Linear Support Vector Machines, which is a type of SVM used for binary classification. We will define and explain what Linear SVM is, provide a mathematical algorithm formulation, discuss Hard Margin SVM and Maximum Margin Hyperplane, and explain the optimization problem and how to find the Maximum Margin Hyperplane. Finally, we will provide a toy example to illustrate the application of the algorithm.
Introduction to SVM
Imagine you have a bunch of apples and oranges and want to separate them into two groups based on their color, shape, and size. You can use SVM to help you find the best way to draw a line between the two groups so that the apples are on one side and the oranges are on the other side.
SVM is particularly useful when working with datasets that have many features, as it can effectively reduce the dimensionality of the data by selecting only the most relevant features through feature selection.
SVM has various applications in real-world problems, including image and text classification, bioinformatics, and finance, making it an essential and widely-used tool in both research and industry.
There are two main types of SVM: hard-margin SVM and soft-margin SVM. Hard-margin SVM aims to separate the two classes perfectly, while soft-margin SVM allows for some misclassifications. Hard margin SVM is suitable for linearly separable data, while soft margin SVM can handle non-linearly separable data using the kernel trick. The choice between hard and soft margin SVM depends on the data and the desired tolerance for misclassification.
Linear support vector machines
Linear Support Vector Machines (SVM) is a type of SVM used for binary classification. Binary classification means separating data into two groups or classes, such as spam and legitimate emails.
The key idea behind Linear SVM is to find the hyperplane that maximizes the margin between the two classes. There can be infinite lines separating the two classes; which one do we choose? A possible option is to select an optimal hyperplane that correctly separates the two classes. The closest points from both classes to the optimal hyperplane are known as the margin. The margin is important because it provides a measure of how confident we are in the classification of new data points. The larger the margin, the more likely new data points will be classified correctly.
The mathematical formulation of the linear SVM algorithm
Let's revisit a common scenario where we need to create a linear boundary to distinguish between two groups, one marked in red and the other in blue.
The challenge is to draw the optimal straight line or the optimal hyperplane that maximizes the distance between red and blue samples to create a wide margin.
Can we approach this problem using mathematics? Consider the graph shown below:
First goal: we want an optimal hyperplane that classifies our training examples correctly based on their input features (x) and their corresponding class labels (y)
To make it more convenient, we introduce a variable so that:
for positive samples and
for negative samples.
To determine the ideal decision rule for the optimal hyperplane, we examine two vectors: a randomly chosen point on the graph and a vector that is perpendicular to the optimal hyperplane . Additionally, we use as a variable to denote the distance between the origin and the optimal hyperplane. The following relationships among these quantities are crucial for establishing the decision rule:
If the vector and vector dot product are larger than the distance , the point is a positive sample.
If the dot product of vector and vector is smaller than the distance , , the point is a negative sample.
If the dot product of vector and vector equals the distance , , the point lies on the optimal hyperplane.
Once we have familiarized ourselves with these relationships, we can proceed to establish the decision rule. Our objective is to determine whether the dot product of vector and vector is greater than or equal to :
It is possible to substitute with without compromising the general applicability of the statement. b is the bias used to shift the decision hyperplane to improve the classification accuracy, and as a result, the rule can be reformulated as follows:
So, we have two rules for positive and negative samples, respectively.
As a result, we can rewrite and combine the equations as follows:
For the samples that lie on the hyperplane, their value according to the hyperplane equation is equal to 0:
Deriving the margin equation
Let be a positive example so that , and be a negative example so that .
The width of the margin is the projection of and on the unit normal vector that is the dot product of and .
To calculate the margin width, take the dot product of the closest points from each group with the perpendicular unit vector of the optimal hyperplane. This measures the distance between the closest points and the hyperplane, projected onto the perpendicular line.
Second goal: maximize the margin of this optimal hyperplane. We want to maximize the margin because it leads to better generalization performance of the model. When the margin is maximized, the optimal hyperplane is positioned as far away as possible from the data points, which reduces the risk of overfitting the model to the training data. Moreover, maximizing the margin improves the robustness of the model to noise and outliers in the data. When the margin is wider, data points can be farther away from the optimal hyperplane and still be correctly classified. This means that small fluctuations in the data or random errors are less likely to change the classification of a data point.
Maximize margin:
We want to find the optimal that maximizes width().
We can take the inverse to turn it into a minimization problem:
To simplify the computation of the gradient, we take the squared norm of : This is the SVM loss function we want to minimize:
We turn the margin into a minimization problem because it simplifies the optimization using well-known techniques to find the optimal hyperplane that maximizes the margin.
Optimization problem
Let's say and are the margins from the red and blue classes, respectively. We want to maximize this margin. The SVM's goal is to find the line with the largest margin, which is called the maximum margin hyperplane. This line is the best boundary between the two groups because it has the most space between them, so there's less chance of making a mistake when sorting new data points in the future.
Combining goal 1 and goal 2:
Here, is the label of the -th data point, and is its corresponding feature vector. This optimization problem is a quadratic programming problem and can be solved using techniques such as the Lagrange multiplier method.
We aim to determine the optimal values of and that maximize the margin while accurately classifying all training examples, using a technique called hard-margin support vector machine. This approach is called "hard" because it strictly enforces the margin requirement, which means no points are allowed to be within the margin.
Hard margin SVM
Hard margin SVM is a type of Support Vector Machine that aims to find the hyperplane that perfectly separates two classes without misclassification. This is done by maximizing the margin, which is the distance between the hyperplane and the closest points from each category, called support vectors. The margin width is twice the distance between the hyperplane and a single support vector. In a hard-margin SVM, any data point that falls on the wrong side of the hyperplane is considered an outlier. The goal is to find the hyperplane with the largest margin possible while ensuring that all data points are correctly classified.
Since hard margin SVM doesn't allow classification errors, the only errors that can occur are distance errors, meaning that some data points may be classified correctly but fall within the margin or on the wrong side of the margin. Therefore, hard margin SVM is sensitive to outliers and noise in the data, and it may not perform well if the data is not linearly separable or if there are too many outliers.
Hard Margin SVM is defined using an error function that tries to maximize the distance between the lines. That is, the error in hard margin SVM consists of the distance error.
Toy example
Let's analyze an example to understand better how the optimal hyperplane is obtained from solving the optimization problem. We will focus on a two-dimensional hyperplane defined by the parameters in this case. Let's consider the toy data set. The data matrix, target values, and separability constraints are summarized below. The inequality on a particular row is the separability constraint for the corresponding data point in that row.
Once the data has been mapped, SVM tries to find the line that maximizes the margin between the classes. The margin is the distance between the line and the closest data points from each class. The line that maximizes the margin is known as the "maximum margin hyperplane".
From the graph, we can easily identify the points which separate the two classes. Those points are the support vectors.
Combining constraint and gives:
Combining constraint and gives:
This means that with equality when and .
We can verify that
satisfies all five constraints, minimizes , and gives the optimal hyperplane as a result. The optimal hyperplane is shown in the following chart:
Optimal Hyperplane:
The error function in the following example is . To find the margin, we can divide 1 by the square root of the error function to take the magnitude.
The total width of the margin from the optimal hyperplane to the positive and negative hyperplane is
Data points and are boxed because their separation constraints are exactly met:
For data point which meets their constraints exactly, . These data points prevent the positive and negative hyperplanes from spreading further — that's why they are called support vectors.
When the error function is large, the margin is small. This is a bad classifier as the distance between the optimal hyperplane to the positive/negative hyperplane is small, which can result in data points falling inside the margin.
Conclusion
Linear SVM is a widely used algorithm for binary classification tasks.
Hard-margin SVM is a variant of linear SVM that aims to find a hyperplane that perfectly separates two classes without any classification errors.
Hard-margin SVM is sensitive to outliers and noise in the data and may not perform well if the data is not linearly separable or if there are too many outliers.
The optimization problem for hard-margin SVM involves minimizing the norm of the weight vector subject to the constraint that all data points are correctly classified.