Backpropagation is a key algorithm in neural network training that adjusts the weights of the network based on the error between predicted and actual outputs. It propagates this error backward from the output to the input layer, enabling gradient descent optimization to adjust the weights.
This process allows deep learning models to learn and adapt to complex data patterns. In this topic, we will explore how backpropagation works.
Classification with a Neural Network
In neural networks, forward propagation is the process where information moves forward through the network from the input layer to the output layer. Each layer of nodes performs specific operations on the inputs it receives and passes the results onto the next layer. In a 2, 2, 1 network, there are three layers: an input layer with two nodes, a hidden layer with two nodes, and an output layer with one node. This setup is often used for binary classification tasks.
Now, let's look into the step-by-step process of forward propagation in such a network.
Initially, the input data, which comprises two values given the structure of the network, is fed into the two neurons in the input layer. Each input neuron is associated with a weight for each connection it has to the neurons in the hidden layer. In this case, there are four weights in total connecting the input layer to the hidden layer. Let's denote the weights as , where the first index denotes the input neuron and the second index denotes the hidden neuron the weight is connected to.
The first step in forward propagation is to compute the weighted sum of the inputs for each neuron in the hidden layer. This is done by multiplying each input value by its corresponding weight and summing these products. The equations for these weighted sums are:
Here, and are the input values, and and are the weighted sums for the first and second neurons in the hidden layer, respectively.
The next step is to apply an activation function to these weighted sums to obtain the activation values of the neurons in the hidden layer. In this case, we'll use the sigmoid function as the activation, which is defined as:
Applying this function to the weighted sums, we get:
Now, and are the activation values of the first and second neurons in the hidden layer, respectively.
The process is then repeated for the connections between the hidden and output layers. The weighted sum for the neuron in the output layer is computed as:
Here, and are the weights connecting the hidden layer to the output layer.
This completes the forward propagation process, resulting in an output value , which is then used in computing the loss and subsequently in the backpropagation process to adjust the weights of the network.
Loss function
Once the neural network generates a predicted output using forward propagation, the next step is to measure how well the network's predictions align with the actual target values. A common way to do this in binary classification tasks is by employing the log loss (or logistic loss) function.
Log loss provides a measure of error between the predicted probabilities generated by the model and the true labels. The formula for log loss is given by:Here, represents the true label of the data point, which is either 0 or 1 in binary classification, and is the predicted probability that the data point belongs to class 1, as obtained from the forward propagation process. The negative sign at the beginning ensures that the error is positive, and the sum of the two terms provides a single scalar value representing the error.
Minimizing log-loss
The goal of backpropagation is to adjust each weight and bias in order to reduce the loss function . Therefore, we need to see how each one of these weights affects the loss. Let's consider we want to adjust the weight of . So we need to find the derivative of with respect to .
Let's look into how we formed this chain rule:
- Loss to output : From the loss equation we can see that loss depends on the predicted output . Therefore, here we are finding how the loss changes as the predicted output changes.
- Output to pre-activation of output layer : From the predicted output equation we can see that depends on pre-activation of output layer . Therefore, here we are finding how changes as changes.
- Pre-activation of output layer to activation of hidden layer: From the pre-activation value or output layer equation we can see that depends on activation of hidden layer . Therefore, here we are finding how changes as changes.
- Activation of hidden layer to pre-activation of hidden layer: From the equation of activation of hidden layer we can see that it depends on pre-activation of hidden layer . Therefore, here we are finding how changes as changes.
- Pre-activation of hidden layer to weight : Finally from the preactivation of hidden layer equation we can see that it depends on weight . Here we are finding how changes as changes.
These partial derivatives tell us exactly in what direction to move each one of the weights and biases to reduce the loss. So that's exactly what we're going to do. We're going to calculate each one of these in order to reduce the log loss function similarly.
Backpropagation
So now you know how to calculate the derivative of the log loss with respect to all the weights and biases of the neural network. We will see how to use this and gradient descent to train the neural network. In backpropagation, we will move from the output layer to the input layer.
Here, represent the weights of each neuron. We need to calculate , , ,,,, and the derivative of loss with biases because that's what tells us how to move each one of the weights and biases.
Let's denote the weights of layer one as and weights of layer two as . Similarly, we can write and . Here the superscript denotes the layer number.
Let's take a look how the weights are updated in the output layers:
- Calculate the partial derivative of each parameter and are the same
-
Use gradient descent to update the parameters and add a learning rate to prevent too big of a step
The learning rate in gradient descent determines the size of the steps taken towards the minimum of the loss function. A proper learning rate ensures a balance between fast convergence and the risk of overshooting the minimum or getting stuck in local minima, making it a key factor in efficient model training.
Now let's take a closer look into neuron 2 of layer 1.
The upstream gradient refers to the gradient of the loss function with respect to the output of a particular layer in the network. This gradient is "upstream" in the sense that it comes from the higher (later) layers in the network, moving backward (or upstream) towards the input layer. In the case of the final output layer, the upstream gradient is usually the derivative of the loss function with respect to the actual output of the network.
The local gradient is the gradient of the output of a layer with respect to its input. It is "local" because it is computed at each layer, based on the layer's own parameters and activation function. This gradient is obtained by differentiating the activation function used in the layer (e.g., sigmoid) with respect to the input to that layer.
The downstream gradient is the gradient that is passed to the lower (earlier) layers in the network during backpropagation. This gradient is the result of the upstream gradient being propagated backward through the network, typically after being multiplied by the local gradient. The downstream gradient becomes the upstream gradient for the next (earlier) layer in the network.
Now let's see how layer 1 weights are updated. When calculating the partial derivatives of layer 1 you will notice that andwere already calculated previously in the output layer. Therefore, we don't need to recompute these values.
Finally, we will apply gradient descent to update the parameters.
Conclusion
Here is a summary of backpropagation:
- Compute the output error: Calculate the error at the output layer (difference between predicted output and actual target).
- Backpropagate the gradient:
- Start with the gradient of the loss function with respect to the output (upstream gradient).
- For each layer, moving backward:
- Compute the local gradient at the layer.
- Multiply the upstream gradient with the local gradient to obtain the downstream gradient.
- Propagate this downstream gradient to the previous layer, where it becomes the upstream gradient for that layer.
- Update weights: Use the gradients to update the weights in each layer, usually with a method like gradient descent.