Computer scienceData scienceMachine learningIntroduction to deep learning

Weight initialization

11 minutes read

Weights are one of the main parameters of any neural network that control the connection strength between a pair of neurons and are learned over the course of training. But how do we choose the initial weight set to ensure that the network learns different features, and the training is optimal or occurs at all?

This topic covers some of the most common weight initialization schemes and their nuances.

Constant initialization

We have to start somewhere. One of the first attempts might be to initialize all weights uniformly to zero or some constant number.

Let's see what happens with zero initialization on a simple synthetic network with one hidden layer:

A synthetic neural network with one hidden layer

We will be considering two activation functions in the hidden layer (so, one network has the same activation in the hidden layer): the sigmoid ( $\sigma(z) = \frac{1}{1+ e^{-z}}$ ) and the commonly used ReLU ( $\text{ReLU}(z) = \max(0, z)$ ).

The sigmoid and the ReLU activations

First, let’s see what happens to the neuron inputs in the hidden layer ( $z^{[l]}$ ) in the general form, assuming the bias is also 0:

$z^{[l]} = a^{[l-1]}W^{[l]} + b^{[l]} \rightarrow z^{[l]} = a^{[l-1]}\cdot 0 + 0 = 0$ As you can see, the inputs will be 0.

In the case of ReLU, the output ( $a^{[l]}$ ) will be 0 during the forward pass since $\text{ReLU}(0) = 0$ . Let’s take a look at the gradient of the loss with respect to $W^{[2]}$ (assuming $\text{ReLU}'(0) = 0$ ):

$\nabla_{W^{[2]}}L = \frac{\partial{L}}{\partial{W^{[2]}}} = \frac{\partial{L}}{\partial{a^{[2]}}}\frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}\frac{\partial{z^{[2]}}}{\partial{W^{[2]}}} = \frac{\partial{L}}{\partial{a^{[2]}}} \cdot 0\cdot 0 = 0$ Next, we can update $W^{[2]}$ (take the step in the direction of the negative gradient):

$W_{t+1}^{[2]} = W_t^{[2]} - \alpha \nabla_{W^{[2]} }L = W_t^{[2]}$ So, in this case, we are not going anywhere (no update is happening). The weights will stay the same, and no learning actually takes place.

In the case of the sigmoid, the neuron inputs are still 0, but the neuron outputs are 0.5 (recall that $\sigma(0) = 0.5$ ). Again, let’s look at the gradient of the loss ( $\sigma'(0) = 0.25$ ):

$\nabla_{W^{[2]}}L = \frac{\partial{L}}{\partial{W^{[2]}}} = \frac{\partial{L}}{\partial{a^{[2]}}}\frac{\partial{a^{[2]}}}{\partial{z^{[2]}}}\frac{\partial{z^{[2]}}}{\partial{W^{[2]}}} = \frac{\partial{L}}{\partial{a^{[2]}}} \begin{bmatrix} 0.25\\ \vdots \\ 0.25 \end{bmatrix} \begin{bmatrix} 0.5\\ \vdots \\ 0.5 \end{bmatrix} = \frac{\partial{L}}{\partial{a^{[2]}}} \begin{bmatrix} 0.125\\ \vdots \\ 0.125 \end{bmatrix}$ Performing the weight update:

$W_{t+1}^{[2]} = W_t^{[2]} - \alpha \nabla_{W^{[2]}}L = W_t^{[2]} - \alpha \frac{\partial{L}}{\partial{a^{[2]}}} \begin{bmatrix} 0.125\\ \vdots \\ 0.125 \end{bmatrix}$ Now, we are moving somewhere, but the issue is — all weights move in the same direction, and all parameters get exactly the same update (all neurons perform the same computation and learn the same feature) — this problem is known as weight symmetry.

As an exercise, you can try to initialize the weights to a constant value, which will also make all weights move in the same direction.

A step towards random initialization

Constant initialization didn't work out well; we need a (symmetry) break. Next try: set the weights to small random numbers from a certain distribution (e.g., the uniform or the normal distribution). When the neurons are unique and random in the beginning, they will be updated differently.

What happens when the weights are set either too low or too high? Since the gradients are proportional to the value of the weights, when the weights are set too low, it will lead to the loss of the signal during backpropagation, also known as the vanishing gradient problem. Setting the weights too high results in exploding gradients, leading to oscillations around the minimum.

To take a step away from these issues, we can impose two simplifying conditions:

The weights and the inputs should be zero-mean and the activation function is approximately linear with small inputs (e.g., tang=h or the sigmoid, and we ignore ReLU for now).
The variance of the activations should be the same across the layers.

The second condition stems from the fact that as the number of inputs grows, so does the variance of the outputs if the weights were randomly initialized just as-is, without any additional modifications. Large output variance contributes to gradient explosion (because when the output variance is large, small changes in the weights can lead to large changes in the loss, making the gradients large, and, as a result, the weight updates will be dramatic, leading to unstable learning). By making the output variance roughly the same across the layers, we ensure that the updates happen at a similar rate.

In the next section, we will see how Xavier initialization works under these assumptions.

Xavier initialization

The main goal of the Xavier initialization is to set the weights such that the variances of the activations are the same across the layers to avoid the vanishing or exploding gradient problem. The Xavier initialization draws the weights randomly from the normal distribution with a mean of 0 and $\frac{1}{n^{[l-1]}}$ variance, and was specifically designed for the sigmoid activation (but can also be derived for tanh):

$W^{[l]} \sim \mathcal{N} \Big(\mu = 0,\, \sigma^2= \frac{1}{n^{[l-1]}}\Big)$

The bias is set to 0. It's possible to set the bias to small random numbers, but the random weights already break the symmetry. Also, since the biases are added ( $xW +b$ ), they do not affect the variance across the layers, so it's common to keep them at 0.

The Xavier initialization is defined as follows:

$W \sim \mathcal{N} \Big(0, \sqrt{\frac{6}{n_{\text{in}} + n _{out}}}\Big),\, W \in U\Big[ - \sqrt{\frac{6}{n_{\text{in}} + n _{out}}} ,\, \sqrt{\frac{6}{n_{\text{in}} + n _{out}}}\Big]$

where $n_{\text{in}}$ and $n_{\text{out}}$ are the numbers of input and output weight connections for a given layer.

The classic Xavier derivation (optional)

The variance of the activations should be the same across the layers is formally stated as

$Var(a^{[l]}) = Var(a^{[l-1]})$ In order to derive this, we assume the weights and the inputs to be mutually independent and identically distributed. We take the tanh activation since it's almost linear around 0 ( $\text{tanh}(z) \approx z$ ), which gives $Var(a^{[l]}) \approx Var(z^{[l]})$ . The bias is set to 0. The input $z^{[l]}$ and the activation $a^{[l]}$ are defined as

$z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]} = W^{[l]}a^{[l-1]} \\ a^{[l]} = \text{tanh}(z^{[l]})$

The tanh activation

$z^{[l]}$ can be expanded to $z^{[l]} = (z^{[l]}_1,..., z^{[l]}_{n^{[l]}} )$ , where each element becomes

$z^{[l]}_k = \sum_{j=1}^{n^{[l-1]}} w_{kj}^{[l]}a^{[l-1]}_j$

Since the variance of the vector is the same as the variance of any of it's entries (which are drawn independently and identically), we get

$Var(a^{[l]}_k) = Var(a^{[l-1]}_k)= Var \Big( \sum_{j=1}^{n^{[l-1]}} w_{kj}^{[l]}a^{[l-1]}_j \Big) = \sum_{j=1}^{n^{[l-1]}} Var \Big(w_{kj}^{[l]}a^{[l-1]}_j \Big)$

Next, we use the product of variances rule and let $X = w_{kj}^{[l]}$ and $Y = a^{[l-1]}_j$ , we get:

$Var \Big(w_{kj}^{[l]}a^{[l-1]}_j \Big) = E[w_{kj}^{[l]}]^2Var(a_j^{[l−1]})+Var(w_{kj}^{[l]})E[a_j^{[l−1]}]^2+Var(w_{kj}^{[l]})Var(a_j^{[l−1]})$

$E[a_j^{[l−1]}]^2 = 0$ and $E[w_{kj}^{[l]}]^2 =0$ (since the inputs are normalized and the weights are initialized with 0 mean), which leads to

$Var(z^{[l]}_k) = \sum_{j=1}^{n^{[l-1]}} Var(w_{kj}^{[l]}) Var(a_j^{[l−1]}) = \sum_{j=1}^{n^{[l-1]}} Var(W^{[l]})Var(a_j^{[l−1]}) = n^{[l-1]} Var(W^{[l]}) Var(a^{[l-1]})$

The weights and the inputs are independent and identically distributed, thus

$Var(w_{kj}^{[l]}) = Var(w_{11}^{[l]}) = ... = Var(W^{[l]}), \\ Var(a_{j}^{[l-1]}) = Var(a_{1}^{[l]}) = ... = Var(a^{[l-1]})$ and similarly, $Var(z^{[l]}) = Var(z^{[l]}_k)$ , which (finally!) gives

$Var(a^{[l]}) = n^{[l-1]}Var(W^{[l]})Var(a^{[l-1]})$ So, if we want to keep the variance the same across the layers ( $Var(a^{[l]}) = Var(a^{[l-1]})$ , the starting point of this section), $Var(W^{[l]})$ should be $\frac{1}{n^{[l-1]}}$ , justifying this initialization.

Xavier works precisely for the activation functions that are symmetric around 0, making it suitable for the sigmoid and the tanh activations. What about the go-to ReLU? This is where He initialization comes in.

He initialization

The He initialization is a modification of Xavier that extends to work with the ReLU activation (or the ReLU variants):

$W^{[l]} \sim \mathcal{N} \Big( \mu = 0,\, \sigma^2= \frac{2}{n^{[l]}}\Big)$

He initialization makes the variance of the outputs approximately 1. The bias is also set to 0.

The main difference between Xavier and He initialization is the variance of the weight values. He variance was specifically derived for the ReLU function, which is not symmetric and not zero-centered. ReLU outputs zero for a negative value and the value itself when the input is positive. Then, about half of the outputs will be zero, and the variance of the outputs will be half of the variance of the inputs. This is why there is $2/n^{[l]}$ , where $2$ compensates for the halving.

Since ReLU is the most popular activation for the hidden layer, He initialization should be used in practice.

Conclusion

As a result, you are now familiar with:

The issues of zero and constant initialization;
The motivation behind Xavier initialization, which is suitable for the sigmoid and tanh activations;
He initialization, which works with the ReLU non-linearity.

How did you like the theory?

Report a typo