Computer scienceData scienceMachine learningClassification

Support vector machines

14 minutes read

Support Vector Machines (SVM) is a robust machine learning algorithm that can be used for classification and regression analysis. In this topic, we will focus specifically on Linear Support Vector Machines, which is a type of SVM used for binary classification. We will define and explain what Linear SVM is, provide a mathematical algorithm formulation, discuss Hard Margin SVM and Maximum Margin Hyperplane, and explain the optimization problem and how to find the Maximum Margin Hyperplane. Finally, we will provide a toy example to illustrate the application of the algorithm.

Introduction to SVM

Imagine you have a bunch of apples and oranges and want to separate them into two groups based on their color, shape, and size. You can use SVM to help you find the best way to draw a line between the two groups so that the apples are on one side and the oranges are on the other side.

SVM is particularly useful when working with datasets that have many features, as it can effectively reduce the dimensionality of the data by selecting only the most relevant features through feature selection.

SVM has various applications in real-world problems, including image and text classification, bioinformatics, and finance, making it an essential and widely-used tool in both research and industry.

There are two main types of SVM: hard-margin SVM and soft-margin SVM. Hard-margin SVM aims to separate the two classes perfectly, while soft-margin SVM allows for some misclassifications. Hard margin SVM is suitable for linearly separable data, while soft margin SVM can handle non-linearly separable data using the kernel trick. The choice between hard and soft margin SVM depends on the data and the desired tolerance for misclassification.

Linear support vector machines

Linear Support Vector Machines (SVM) is a type of SVM used for binary classification. Binary classification means separating data into two groups or classes, such as spam and legitimate emails.

The key idea behind Linear SVM is to find the hyperplane that maximizes the margin between the two classes. There can be infinite lines separating the two classes; which one do we choose? A possible option is to select an optimal hyperplane that correctly separates the two classes. The closest points from both classes to the optimal hyperplane are known as the margin. The margin is important because it provides a measure of how confident we are in the classification of new data points. The larger the margin, the more likely new data points will be classified correctly.

The mathematical formulation of the linear SVM algorithm

Let's revisit a common scenario where we need to create a linear boundary to distinguish between two groups, one marked in red and the other in blue.

Example dataset

The challenge is to draw the optimal straight line or the optimal hyperplane that maximizes the distance between red and blue samples to create a wide margin.

Can we approach this problem using mathematics? Consider the graph shown below:

Optimal hyperplane

First goal: we want an optimal hyperplane that classifies our training examples correctly based on their input features (x) and their corresponding class labels (y)

To make it more convenient, we introduce a variable $y_i$ so that:

$y_i = +1$ for positive samples and

$y_i = -1$ for negative samples.

To determine the ideal decision rule for the optimal hyperplane, we examine two vectors: a randomly chosen point on the graph $(u)$ and a vector that is perpendicular to the optimal hyperplane $(w)$ . Additionally, we use $c$ as a variable to denote the distance between the origin and the optimal hyperplane. The following relationships among these quantities are crucial for establishing the decision rule:

If the vector and vector dot product are larger than the distance $c$ , the point is a positive sample.
If the dot product of vector $u$ and vector $w$ is smaller than the distance $c$ , $\vec{w} \cdot \vec{u} < c$ , the point is a negative sample.
If the dot product of vector $u$ and vector $w$ equals the distance $c$ , $\vec{w} \cdot \vec{u} = c$ , the point lies on the optimal hyperplane.

Once we have familiarized ourselves with these relationships, we can proceed to establish the decision rule. Our objective is to determine whether the dot product of vector $u$ and vector $w$ is greater than or equal to $c$ : $\vec{w} \cdot \vec{u} \geq c$

It is possible to substitute $c$ with $-b$ without compromising the general applicability of the statement. b is the bias used to shift the decision hyperplane to improve the classification accuracy, and as a result, the rule can be reformulated as follows: $\vec{w} \cdot \vec{u} + b \geq 0$

So, we have two rules for positive and negative samples, respectively.

\vec{w} \cdot \vec{x}_+ + b \geq 1

\vec{w} \cdot \vec{x}_- + b \leq -1

As a result, we can rewrite and combine the equations as follows:

y_i(\vec{w} \cdot \vec{x} + b) \geq 1

For the samples that lie on the hyperplane, their value according to the hyperplane equation is equal to 0:

y_i(\vec{w} \cdot \vec{x} + b) - 1 = 0

Deriving the margin equation

Derivation of margin equation

Let $x_+$ be a positive example so that $\vec{w}\cdot x_+ - b = 1$ , and $x_-$ be a negative example so that $\vec{w}\cdot x_- - b = -1$ .

The width of the margin is the projection of $x_+$ and $x_-$ on the unit normal vector that is the dot product of $x_+ - x_-$ and $\frac{w}{\left\lVert w \right\rVert }$ .

$\text{width}=(x_+ - x_-) \cdot \frac {w}{\left\lVert w \right\rVert}\\ \hspace{1cm} = \frac{ (x_+ - x_-) \cdot w}{\left\lVert w \right\rVert}\\ \hspace{1cm}=\frac{x_+ \cdot w - x_-\cdot w}{\left\lVert w \right\rVert}\\ \hspace{1cm}=\frac{1-b-(-1-b)}{\left\lVert w \right\rVert}\\ \hspace{1cm}=\frac{2}{\left\lVert w \right\rVert}\\$

To calculate the margin width, take the dot product of the closest points from each group with the perpendicular unit vector of the optimal hyperplane. This measures the distance between the closest points and the hyperplane, projected onto the perpendicular line.

Second goal: maximize the margin of this optimal hyperplane. We want to maximize the margin because it leads to better generalization performance of the model. When the margin is maximized, the optimal hyperplane is positioned as far away as possible from the data points, which reduces the risk of overfitting the model to the training data. Moreover, maximizing the margin improves the robustness of the model to noise and outliers in the data. When the margin is wider, data points can be farther away from the optimal hyperplane and still be correctly classified. This means that small fluctuations in the data or random errors are less likely to change the classification of a data point.

Maximize margin:

$\underset{b, {w}}{\mathrm{max}} \hspace{0.1cm} \text{width(w)} =\underset{b, {w}}{\mathrm{max}} \frac{2}{\left\lVert w \right\rVert}$

We want to find the optimal $w$ that maximizes width( $\bold{w}$ ).

We can take the inverse to turn it into a minimization problem:

$\underset{b, {w}}{\mathrm{min}} \hspace{0.1cm} \text{width(w)} =\underset{b, {w}}{\mathrm{min}} \frac{1}{2} \left\lVert w \right\rVert$ To simplify the computation of the gradient, we take the squared norm of $w$ : This is the SVM loss function we want to minimize:

$\underset{b, {w}}{\mathrm{min}} \frac{1}{2}w^{\intercal}w = \underset{b, {w}}{\mathrm{min}} \frac{1}{2} \left\lVert w \right\rVert^{2}$

We turn the margin into a minimization problem because it simplifies the optimization using well-known techniques to find the optimal hyperplane that maximizes the margin.

Optimization problem

Let's say $w_1$ and $w_2$ are the margins from the red and blue classes, respectively. We want to maximize this margin. The SVM's goal is to find the line with the largest margin, which is called the maximum margin hyperplane. This line is the best boundary between the two groups because it has the most space between them, so there's less chance of making a mistake when sorting new data points in the future.

Combining goal 1 and goal 2:

$\underset{w}{\mathrm{min}} \frac{1}{2} \left\lVert w \right\rVert^{2} \text{subject to} \hspace{0.1cm} y_i(\vec{w} \cdot \vec{x} + b) \geq 1\hspace{0.1cm} \text{for all }\hspace{0.1cm} n = 1,...,N$ Here, $y_i$ is the label of the $i$ -th data point, and $x_i$ is its corresponding feature vector. This optimization problem is a quadratic programming problem and can be solved using techniques such as the Lagrange multiplier method.

We aim to determine the optimal values of $w$ and $b$ that maximize the margin while accurately classifying all training examples, using a technique called hard-margin support vector machine. This approach is called "hard" because it strictly enforces the margin requirement, which means no points are allowed to be within the margin.

Hard margin SVM

Hard margin SVM is a type of Support Vector Machine that aims to find the hyperplane that perfectly separates two classes without misclassification. This is done by maximizing the margin, which is the distance between the hyperplane and the closest points from each category, called support vectors. The margin width is twice the distance between the hyperplane and a single support vector. In a hard-margin SVM, any data point that falls on the wrong side of the hyperplane is considered an outlier. The goal is to find the hyperplane with the largest margin possible while ensuring that all data points are correctly classified.

Since hard margin SVM doesn't allow classification errors, the only errors that can occur are distance errors, meaning that some data points may be classified correctly but fall within the margin or on the wrong side of the margin. Therefore, hard margin SVM is sensitive to outliers and noise in the data, and it may not perform well if the data is not linearly separable or if there are too many outliers.

Hard Margin SVM is defined using an error function that tries to maximize the distance between the lines. That is, the error in hard margin SVM consists of the distance error.

\text{Error} =\text{Distance Error}

Toy example

Let's analyze an example to understand better how the optimal hyperplane $(b*, w*)$ is obtained from solving the optimization problem. We will focus on a two-dimensional hyperplane defined by the parameters $(b, w_1, w_2)$ in this case. Let's consider the toy data set. The data matrix, target values, and separability constraints are summarized below. The inequality on a particular row is the separability constraint for the corresponding data point in that row.

$x_1$	$x_2$	$y$	$\text{constraints}$
$0$	$0$	$-1$	$-b \ge1$
$0$	$2$	$-1$	$-( 2w_2+b) \ge1$
$2$	$2$	$-1$	$-(2w_1 + 2w_2 +b) \ge1$
$2$	$0$	$+1$	$2w_1 +b \ge1$
$3$	$0$	$+1$	$3w_1 +b \ge1$

Once the data has been mapped, SVM tries to find the line that maximizes the margin between the classes. The margin is the distance between the line and the closest data points from each class. The line that maximizes the margin is known as the "maximum margin hyperplane".

Example dataset

From the graph, we can easily identify the points which separate the two classes. Those points are the support vectors.

Optimal hyperplane

Combining constraint $(i)$ and $(iv)$ gives:

$w_1 \ge1$ Combining constraint $(ii)$ and $(iii)$ gives: $w_2 \le-1$

This means that $\frac{1}{2}({w_1}^2+{w_2}^2) \ge1$ with equality when $w_1 = 1$ and $w_2 = -1$ .

We can verify that

$b^*=-1,{w_1}^*=1,{w_2}^*=-1$ satisfies all five constraints, minimizes $\frac{1}{2}({w_1}^2+{w_2}^2)$ , and gives the optimal hyperplane as a result. The optimal hyperplane is shown in the following chart:

Optimization problem

Optimal Hyperplane:

$g(x) = \text{sign} (x_1-x_2-1)$

The error function in the following example is ${w_1}^2 + {w_2}^2 = (1)^2 +(-1)^2 = 2$ . To find the margin, we can divide 1 by the square root of the error function to take the magnitude.

$\text{margin:} \frac{1}{\left\lVert w^{*} \right\rVert } = \frac{1}{\sqrt{2}} \approx0.707$

The total width of the margin from the optimal hyperplane to the positive and negative hyperplane is $1.414$

Data points $(i), (iii)$ and $(iv)$ are boxed because their separation constraints are exactly met: $y_n(w^{* \intercal}x_n+b^{*})=1$

For data point which meets their constraints exactly, $\text{dist}(x_n,g)=\frac{1}{\left\lVert w^{*} \right\rVert}$ . These data points prevent the positive and negative hyperplanes from spreading further — that's why they are called support vectors.

When the error function is large, the margin is small. This is a bad classifier as the distance between the optimal hyperplane to the positive/negative hyperplane is small, which can result in data points falling inside the margin.

Conclusion

Linear SVM is a widely used algorithm for binary classification tasks.
Hard-margin SVM is a variant of linear SVM that aims to find a hyperplane that perfectly separates two classes without any classification errors.
Hard-margin SVM is sensitive to outliers and noise in the data and may not perform well if the data is not linearly separable or if there are too many outliers.
The optimization problem for hard-margin SVM involves minimizing the norm of the weight vector subject to the constraint that all data points are correctly classified.

5 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo

Support vector machines

Introduction to SVM

Linear support vector machines

The mathematical formulation of the linear SVM algorithm

Deriving the margin equation

Optimization problem

Hard margin SVM

Toy example

Conclusion

Related topics