Computer scienceData scienceMachine learningIntroduction to deep learning

Backpropagation

12 minutes read

Backpropagation is a key algorithm in neural network training that adjusts the weights of the network based on the error between predicted and actual outputs. It propagates this error backward from the output to the input layer, enabling gradient descent optimization to adjust the weights.

This process allows deep learning models to learn and adapt to complex data patterns. In this topic, we will explore how backpropagation works.

Classification with a Neural Network

In neural networks, forward propagation is the process where information moves forward through the network from the input layer to the output layer. Each layer of nodes performs specific operations on the inputs it receives and passes the results onto the next layer. In a 2, 2, 1 network, there are three layers: an input layer with two nodes, a hidden layer with two nodes, and an output layer with one node. This setup is often used for binary classification tasks.

Now, let's look into the step-by-step process of forward propagation in such a network.

Initially, the input data, which comprises two values given the structure of the network, is fed into the two neurons in the input layer. Each input neuron is associated with a weight for each connection it has to the neurons in the hidden layer. In this case, there are four weights in total connecting the input layer to the hidden layer. Let's denote the weights as w11,w12,w21,w22w_{11}, w_{12}, w_{21}, w_{22}, where the first index denotes the input neuron and the second index denotes the hidden neuron the weight is connected to.

The first step in forward propagation is to compute the weighted sum of the inputs for each neuron in the hidden layer. This is done by multiplying each input value by its corresponding weight and summing these products. The equations for these weighted sums are: z1=x1w11+x2w21+b1z2=x1w12+x2w22+b2z_1 = x_1 w_{11} + x_2w_{21} + b_1 \\ z_2 = x_1 w_{12} + x_2 w_{22} + b_2

Here, x1x_1 and x2x_2 are the input values, and z1z_1 and z2z_2 are the weighted sums for the first and second neurons in the hidden layer, respectively.

The next step is to apply an activation function to these weighted sums to obtain the activation values of the neurons in the hidden layer. In this case, we'll use the sigmoid function as the activation, which is defined as:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Applying this function to the weighted sums, we get:

a1=σ(z1)a2=σ(z2)a_1 = \sigma(z_1) \\ a_2 = \sigma(z_2)

Now, a1a_1 and a2a_2 are the activation values of the first and second neurons in the hidden layer, respectively.

The process is then repeated for the connections between the hidden and output layers. The weighted sum for the neuron in the output layer is computed as:

z=a1w1+a2w2+bz = a_1 w_{1} + a_2 w_{2} + b

Here, w1w_{1} and w2w_{2} are the weights connecting the hidden layer to the output layer.

y^=σ(z)\hat{y} = \sigma(z)

This completes the forward propagation process, resulting in an output value y^\hat{y}, which is then used in computing the loss and subsequently in the backpropagation process to adjust the weights of the network.

Loss function

Once the neural network generates a predicted output y^\hat{y} using forward propagation, the next step is to measure how well the network's predictions align with the actual target values. A common way to do this in binary classification tasks is by employing the log loss (or logistic loss) function.

Log loss provides a measure of error between the predicted probabilities generated by the model and the true labels. The formula for log loss is given by:L(y,y^)=(ylog(y^)+(1y)log(1y^))L(y, \hat{y}) = -(y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y}))Here, yy represents the true label of the data point, which is either 0 or 1 in binary classification, and y^\hat{y} is the predicted probability that the data point belongs to class 1, as obtained from the forward propagation process. The negative sign at the beginning ensures that the error is positive, and the sum of the two terms provides a single scalar value representing the error.

Minimizing log-loss

The goal of backpropagation is to adjust each weight and bias in order to reduce the loss function L(y,y^)L(y, \hat{y}). Therefore, we need to see how each one of these weights affects the loss. Let's consider we want to adjust the weight of w11w_{11}. So we need to find the derivative of LL with respect to w11w_{11}.

Lw11=Ly^.y^z.za1.a1z1.z1w11{\frac{\partial{L}}{\partial{w_{11}}}} = {\frac{\partial{L}}{\partial{\hat{y}}}}. {\frac{\partial{\hat{y}}}{\partial{z}}}. {\frac{\partial{z}}{\partial{a_1}}}. {\frac{\partial{a_1}}{\partial{z_1}}}. {\frac{\partial{z_1}}{\partial{w_{11}}}}

Lw11=(yy^)y^(1y^)y^(1y^)w1a1(1a1)x1\frac{\partial L}{\partial w_{11}} = \frac{-(y - \hat{y})}{\hat{y}(1 - \hat{y})}\cdot \hat{y}(1-\hat{y})\cdot w_{1} \cdot a_{1}(1 - a_{1}) \cdot x_{1}Lw11=x1w1a1(1a1)(yy^)\frac{\partial L}{\partial w_{11}} = -x_1 w_1 a_1 (1 - a_1)(y - \hat{y})

The derivative of the sigmoid function,σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)), determines the local gradient in the backpropagation process.

Let's look into how we formed this chain rule:

  1. Loss to output (Ly^)({\frac{\partial{L}}{\partial{\hat{y}}}}): From the loss equation we can see that loss (L)(L) depends on the predicted output (y^)(\hat{y}). Therefore, here we are finding how the loss (L)(L) changes as the predicted output (y^)(\hat{y}) changes.
  2. Output to pre-activation of output layer (y^z)({\frac{\partial{\hat{y}}}{\partial{z}}}): From the predicted output (y^)(\hat{y}) equation we can see that y^\hat{y} depends on pre-activation of output layer (z)(z). Therefore, here we are finding how y^\hat{y} changes as zz changes.
  3. Pre-activation of output layer to activation of hidden layer(za1)({\frac{\partial{z}}{\partial{a_1}}}): From the pre-activation value or output layer (z)(z) equation we can see that zz depends on activation of hidden layer (a1)(a_1). Therefore, here we are finding how zz changes as a1a_1 changes.
  4. Activation of hidden layer to pre-activation of hidden layer(a1z1)({\frac{\partial{a_1}}{\partial{z_1}}}): From the equation of activation of hidden layer (a1)(a_1) we can see that it depends on pre-activation of hidden layer z1z_1. Therefore, here we are finding how a1a_1 changes as z1z_1 changes.
  5. Pre-activation of hidden layer to weight (z1w11)({\frac{\partial{z_1}}{\partial{w_{11}}}}): Finally from the preactivation of hidden layer (z1)(z_1) equation we can see that it depends on weight (w11)(w_{11}). Here we are finding how z1z_1 changes as w11w_{11} changes.

These partial derivatives tell us exactly in what direction to move each one of the weights and biases to reduce the loss. So that's exactly what we're going to do. We're going to calculate each one of these (w11,w12,w21,w22,b1,b2)(w_{11}, w_{12}, w_{21}, w_{22}, b_1, b_2) in order to reduce the log loss function similarly.

Try to relate to the forward propagation equation when forming the chain rule.

Backpropagation

So now you know how to calculate the derivative of the log loss with respect to all the weights and biases of the neural network. We will see how to use this and gradient descent to train the neural network. In backpropagation, we will move from the output layer to the input layer.

Here, w=[w1w2],w1=[w11w21],w2=[w21w22]w = \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}, w_1 = \begin{bmatrix} w_{11} \\ w_{21} \end{bmatrix}, w_2 = \begin{bmatrix} w_{21} \\ w_{22} \end{bmatrix}represent the weights of each neuron. We need to calculate Ly^\frac{\partial L}{\partial\hat{y}}, Lw1\frac{\partial L}{\partial w_1}, Lw2\frac{\partial L}{\partial w_2},Lw11\frac{\partial L}{\partial w_{11}},Lw21\frac{\partial L}{\partial w_{21}},Lw12\frac{\partial L}{\partial w_{12}},Lw22\frac{\partial L}{\partial w_{22}} and the derivative of loss with biases because that's what tells us how to move each one of the weights and biases.

Let's denote the weights of layer one as w[1]w^{[1]} and weights of layer two as w[2]w^{[2]}. Similarly, we can write b[1]b^{[1]} and b[2]b^{[2]}. Here the superscript denotes the layer number.

Let's take a look how the weights are updated in the output layers:

  1. Calculate the partial derivative of each parameterLw[2]=La[2].a[2]z[2].z[2]w[2]{\frac{\partial{L}}{\partial{w^{[2]}}}} = {\frac{\partial{L}}{\partial{a^{[2]}}}}. {\frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}}. {\frac{\partial{z^{[2]}}}{\partial{w^{[2]}}}}Lb[2]=La[2].a[2]z[2].z[2]b[2]{\frac{\partial{L}}{\partial{b^{[2]}}}} = {\frac{\partial{L}}{\partial{a^{[2]}}}}. {\frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}}. {\frac{\partial{z^{[2]}}}{\partial{b^{[2]}}}} La[2]{\frac{\partial{L}}{\partial{a^{[2]}}}} and Ly^{\frac{\partial{L}}{\partial{\hat{y}}}}are the same
  2. Use gradient descent to update the parameters and add a learning rate α\alpha to prevent too big of a stepwiwiαLw[2]w_{i} \leftarrow w_i - \alpha \frac{\partial L}{\partial w^{[2]}}bibiαLb[2]b_{i} \leftarrow b_i - \alpha\frac{\partial L}{\partial b^{[2]}}

The learning rate in gradient descent determines the size of the steps taken towards the minimum of the loss function. A proper learning rate ensures a balance between fast convergence and the risk of overshooting the minimum or getting stuck in local minima, making it a key factor in efficient model training.

Now let's take a closer look into neuron 2 of layer 1.

An illustration for the local, upstream, and downstream gradients

The upstream gradient refers to the gradient of the loss function with respect to the output of a particular layer in the network. This gradient is "upstream" in the sense that it comes from the higher (later) layers in the network, moving backward (or upstream) towards the input layer. In the case of the final output layer, the upstream gradient is usually the derivative of the loss function with respect to the actual output of the network.

The local gradient is the gradient of the output of a layer with respect to its input. It is "local" because it is computed at each layer, based on the layer's own parameters and activation function. This gradient is obtained by differentiating the activation function used in the layer (e.g., sigmoid) with respect to the input to that layer.

The downstream gradient is the gradient that is passed to the lower (earlier) layers in the network during backpropagation. This gradient is the result of the upstream gradient being propagated backward through the network, typically after being multiplied by the local gradient. The downstream gradient becomes the upstream gradient for the next (earlier) layer in the network.

Now let's see how layer 1 weights are updated. When calculating the partial derivatives of layer 1 you will notice that La[2]{\frac{\partial{L}}{\partial{a^{[2]}}}}anda[2]z[2]{\frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}}were already calculated previously in the output layer. Therefore, we don't need to recompute these values.Lw[1]=La[2].a[2]z[2].z[2]a[1].a[1]z[1].z[1]w[1]{\frac{\partial{L}}{\partial{w^{[1]}}}} = {\frac{\partial{L}}{\partial{a^{[2]}}}}. {\frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}}. {\frac{\partial{z^{[2]}}}{\partial{a^{[1]}}}}.{\frac{\partial{a^{[1]}}}{\partial{z^{[1]}}}}.{\frac{\partial{z^{[1]}}}{\partial{w^{[1]}}}}Lb[1]=La[2].a[2]z[2].z[2]a[1].a[1]z[1].z[1]b[1]{\frac{\partial{L}}{\partial{b^{[1]}}}} = {\frac{\partial{L}}{\partial{a^{[2]}}}}. {\frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}}. {\frac{\partial{z^{[2]}}}{\partial{a^{[1]}}}}.{\frac{\partial{a^{[1]}}}{\partial{z^{[1]}}}}.{\frac{\partial{z^{[1]}}}{\partial{b^{[1]}}}}

Finally, we will apply gradient descent to update the parameters.wiwiαLw[1]w_i \leftarrow w_i - \alpha \frac{\partial L}{\partial w^{[1]}}bibiαLb[1]b_i \leftarrow b_i - \alpha\frac{\partial L}{\partial b^{[1]}}

Conclusion

Here is a summary of backpropagation:

  1. Compute the output error: Calculate the error at the output layer (difference between predicted output and actual target).
  2. Backpropagate the gradient:
    • Start with the gradient of the loss function with respect to the output (upstream gradient).
    • For each layer, moving backward:
      • Compute the local gradient at the layer.
      • Multiply the upstream gradient with the local gradient to obtain the downstream gradient.
      • Propagate this downstream gradient to the previous layer, where it becomes the upstream gradient for that layer.
  3. Update weights: Use the gradients to update the weights in each layer, usually with a method like gradient descent.
3 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo