Computer scienceData scienceMachine learningIntroduction to deep learning

Activation functions

10 minutes read

The activation function is one of the core components of the neural network architecture. The introduction of non-linearities makes the network capable of learning more complex patterns than simple linear regression.

In this topic, we will take a look at some of the most popular activation functions.

Considering the linear case

To start off, let's see why we need to introduce non-linearity between the layers in the first place by considering a simple two-layer network consisting of two linear layers without any activation:

An example of a network with two linear layers

Any combination of linear functions can only result in a linear function, meaning that two sequential linear layers are equivalent to a single linear layer. A linear layer can only fit a straight line to the data, resulting in an overly simplistic decision boundary that fails to capture any specifics of the dataset. However, when non-linearity is introduced after the linear layer, the network becomes capable of learning complex patterns instead of being limited to fitting a straight line:

Two linear layer architecture with and without the ReLU non-linearity

Activation functions for classification output

One of the first introduced activation functions was the sigmoid, which has the following form:

Activation function	The derivative
$\sigma(z) = \frac{1}{1+e^{-z}}$	$\sigma'(z) = \sigma(z)\cdot(1 -\sigma(z))$

The sigmoid activation and its derivative

The input is a real number, and the output is squashed between 0 and 1. Specifically, large negative numbers become 0, and large positive numbers become 1. The sigmoid activation function has two significant downsides:

Sigmoids saturate (saturating activation squeezes the input), leading to the vanishing gradient problem. The gradient is near zero when activation saturates at 0 or 1. During backpropagation, the gradients are multiplied together to update the weights. This product tends to be extremely small, particularly when dealing with deep networks that have many layers, resulting in updates close to zero. This blocks the gradient flow and halts training.
Sigmoid's outputs are not zero-centered. When subsequent neurons process non-zero-centered data, it can lead to issues during gradient descent. This is because backpropagation can produce all-positive or all-negative gradients, potentially causing oscillations in weight updates. However, this problem is somewhat mitigated by the variable signs in the final weight update across a data batch, making it less severe compared to the saturated activation issue.

In practice, sigmoid should never be used as the activation function in hidden layers due to these issues. However, sigmoid can still be applied in the output layer for binary classification tasks.

Softmax is the default choice for the output layer in multi-class classification. It transforms the input into a vector of probabilities, with each value representing the probability of the input belonging to a particular class.

$\text{Softmax} (z_i) = \frac{e^{z_i}}{\sum_{j = 1}^Ke^{z_j}}$ where $K$ is the number of classes.

The Softmax activation and its derivative

The sum of the outputs will be equal to 1. Softmax is most suitable for mutually exclusive classes, where each instance is only allowed to belong to one class, and it is not suitable for multi-label classification. Softmax has a tendency to amplify the largest inputs and suppress the smaller ones.

Piecewise linear activation functions

ReLU (Rectified Linear Unit) is one of the most widely used activation functions for the hidden layer, essentially thresholding at 0:

Activation function	The derivative
$\text{ReLU}(z) = \begin{cases} z, & \text{if }\ z> 0\, \\ 0, & \text{if} \ z\leq0 \end{cases}$	$\text{ReLU}'(z) = \begin{cases} 1, & \text{if }\ z> 0\, \\ 0, & \text{if} \ z < 0 \end{cases}$

$\text{ReLU}(z) = \max(0, z)$

ReLU activation and it's derivative

ReLU's range is $[0, +\infin)$ . ReLU, unlike sigmoid, does not saturate but is prone to the problem of dying neurons. For instance, a large gradient passing through a ReLU neuron could impact the weight updates such that the neuron won't activate anymore, and the gradient that passes through the unit would perpetually be zero from that moment onwards. ReLU units have the potential to stop functioning during training as they can be sidelined from the data manifold. This issue is especially relevant if the learning rate is set too high or the initialized random weights are too large. However, a proper choice of the learning rate and careful initialization can fix this to an extent.

Leaky ReLU was introduced to address the issue of dying neurons. Instead of having the function be zero when $z<0$ , Leaky ReLU has a slight positive slope ( $\alpha$ ), such as 0.1:

Activation function	The derivative
$\text{Leaky ReLU}(z) = \begin{cases} z, & \text{if }\ z> 0\, \\ \alpha \cdot z, & \text{if} \ z\leq0 \end{cases}$	$\text{Leaky ReLU}'(z) = \begin{cases} 1, & \text{if }\ z> 0\, \\ \alpha, & \text{if} \ z\leq0 \end{cases}$

$\text{Leaky ReLU}(z) = \max(\alpha \cdot z, z)$

The leaky ReLU activation and its derivative

$\alpha$ is a parameter in Leaky ReLU and can be fine-tuned. Leaky ReLU lies in the $(-\infin, +\infin)$ range, and is less sensitive to weight initialization when compared to ReLU.

Exponent-based activation functions

Tanh squashes a real-valued number to the range [-1, 1]. Like the sigmoid, its activations saturate, but unlike the sigmoid, its output is zero-centered:

Activation function	The derivative
$\text{tanh}(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$	$\text{tanh}'(z) = 1 - \text{tanh}(z)^2$

Tanh activation and its derivative

An interesting fact about tanh is that is it a rescaled sigmoid: $\text{tanh}(z) = 2 \sigma(2z) - 1$ . In practice, the usage of tanh is not that common.

The Exponential Linear Unit (ELU) does not suffer from either the vanishing gradients problem or the dying neurons problem and has the following form:

Activation function	The derivative
$\text{ELU}(z) = \begin{cases} z, & \text{if }\ z\geq 0\, \\ \alpha \cdot (e^z - 1), & \text{if } z<0 \end{cases}$	$\text{ELU}'(z)= \begin{cases} 1, & \text{if }\ z> 0\, \\ \alpha \cdot e^z , & \text{if } z<0 \end{cases}$

ELU activation and its derivative

ELU has an $\alpha$ parameter that determines the value of the function's saturation for negative inputs (to be exact, the negative inputs saturate towards $-\alpha$ ) and is often set to 1.0 or a value from the [0.1, 0.3] range.

Some considerations for using different activation functions

We covered several activation functions, so, now the question is: which activation should be used?

Use ReLU as the go-to activation function, and monitor the learning rate and the fraction of dead units in the network (but this applies to many activations and the training process in general). Leaky ReLU also works well.
Sigmoid should never be used in the hidden layers, only in the output layer for binary classification tasks.
Weight initialization matters. If the weights are initialized too large, especially with ReLU, it can lead to dead neurons where they are constantly outputting 0 or negative values for all inputs, halting the training process.
Tanh almost always will be worse than ReLU.
Mixing different activation functions in the hidden layers of the architecture is not forbidden, but it is rarely done. A pretty standard setup is ReLU (or its variant) for the hidden layers and Softmax for the output layer.

Conclusion

As a result, you are now familiar with the following:

Some of the most popular activation functions;
The reasoning behind introducing the non-linearity;
Classical activations (e.g., sigmoid or tanh) and their limitations;
The vanishing gradient problem and the dying neuron problem, as well as some ways to mitigate them.

2 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo