Computer scienceData scienceMachine learningIntroduction to deep learning

Multilayer perceptron

10 minutes read

Multilayer perceptrons (MLPs) or feedforward neural networks are a class of neural networks that are composed of multiple layers of perceptrons, which are small processing units, connected in a feedforward manner. They are widely used for supervised learning. With one or more hidden layers between the input and output layers, they can model complex relationships between inputs and outputs.

They are popular for pattern recognition tasks such as computer vision and NLP. Their ability to model complex nonlinear relationships using multiple layers of simple neuron-like processing units makes them useful for real-world pattern modeling across several domains.

What are perceptrons?

Perceptron is a class of artificial neurons that represents the simplest possible neural network unit. It operates on numeric inputs, calculating the weighted sum of the inputs similar to linear regression models.

A basic model of a perceptron

The illustration above depicts a perceptron with three input nodes $x_1,x_2,x_3$ each associated with weights $w_1,w_2,w_3$ respectively. These weight values determine the relative influence that each input has on the output $y$ . Let's see how perceptron computes the output.

\begin{cases} y = 1,\ \text{if } \sum w_i \cdot x_i > θ\\ y = 0, \ \text{if } \sum w_i \cdot x_i \leq θ \\ \end{cases}

The output $y$ — either 0 or 1, is determined by whether this weighted sum exceeds a pre-defined threshold value $\theta$ .

The most minimalistic perceptron consists of a single input layer and an output node. The goal of a feed-forward network is to approximate some function $f^*$ by learning the parameters $\theta$ that best define the input-output relation —

y = f(x;\theta)

where $x$ is the input, $y$ is the output, and $\theta$ denotes the parameters to be learned.

Increased complexity: multi-layered perceptron network

In contrast to a single perceptron, MLPs are characterized by a layered architecture consisting of an input layer, one or more hidden layers, and an output layer.

Before proceeding further, its important to get an intuition what role does number of hidden layer play in the network.

When no hidden layers are present, the MLP reduces to a linear regression model. In this simplified form, the network lacks the capability to capture nonlinear patterns and can only model linear relationships between the input and output variables. As the number of hidden layers increases, the network gains capacity to capture nonlinear patterns, enabling it to model increasingly complex relationships between variables. Thus, the degree of non-linearity in the data determines the complexity of the network required, with more hidden layers facilitating the modeling of higher-order nonlinear relationships.

A model for the multi-layered perceptron

The hidden units combine inputs via a set of learned weights and activation functions into meaningful features. Also, it applies nonlinear activations (such as ReLU, Tanh, etc) to transform input representations into formats that perform the separation of not-linearly separable classes.

The training of a feed-forward network involves iteratively adjusting the parameters to achieve the best possible approximation of the target function. This approximation known as the Universal Approximation Theorem states that a multilayer perceptron (MLP) with at least one single hidden layer containing a finite number of neurons can approximate any continuous function. In simple words it tells that neural networks are strong function approximators.

Computational graphs

Computational graphs are a way of expressing mathematical expression in a directed graph where nodes correspond to variables. The nodes are ordered in such a way that we can compute their output one after the other. Nodes in the graph correspond to input or output variables and mathematical operation, while edges represent the flow of data between them. The direction of edges indicate the sequence of operations.

Let's understand better with examples. The goal is to represent $Z(x, y) = x + y$ . In this example the two variable circles ( $x$ and $y$ ) connect with arrows to a single output circle ( $Z$ ), showing that $Z$ is calculated by adding $x$ and $y$ .

A computational graph for z = x+y

Another example: $Z(A,x,b)=Ax+b$ :

A computation graph for z = Ax+b

Forward propagation in MLP

Forward propagation refers to the sequential calculation of outputs from the input layer through the hidden layers all the way to the output layer. During the forward pass, each neuron receives inputs from the previous layer, performs a dot product with the connection weights, adds the bias and applies the activation function to compute its output. This output then serves as input to the next layer, propagating information forward from input features all the way through the neural network layers. Let's see this mathematically.

An MLP to perform forward propagation

The network parameters are the weights and the biases (the parameters are the values that are learned during training):

$w_{jk}^{l}$ — weight from the $k^{th}$ neuron in the $(l-1)^{th}$ layer to the $j^{th}$ neuron in the $l^{th}$ layer.
$b_j^l$ — bias of the $j^{th}$ neuron in the $l^{th}$ layer.

Here, $a_j^l$ — activation or output of the $j^{th}$ neuron in the $l^{th}$ layer. The activation $a_j^{l}$ of the neuron the $l^{th}$ layer is related to the activation in the $(l−1)^{th}$ layer by the equation

\begin{align*} a_j^l = \sigma \left( \sum_{k} w_{jk}^{l} a_{k}^{l-1} + b_{j}^{l} \right) \end{align*}

where the sum is over all $k$ neurons in $(l−1)^{th}$ layer and $\sigma$ denotes the activation function.

A calculation example

Compact vectorized form

\begin{align*} a^l &= \sigma \left( w^{l} a^{l-1} + b^{l} \right) \\ a_j^l &= \sigma(z_{j}^{l}) \end{align*}

Let

\begin{align*} z^l &= w^{l} a^{l-1} + b^{l} \\ &= \sum_k w_{jk}^{l} a_{k}^{l-1} + b_j^{l} \\ &= \sum_k w_{jk}^{l} a_{k}^{l-1} + b_j^{l} \end{align*}

where $z^l$ is the weighted inputs to all the neurons in layer $l$ .

An example of forward propagation

Let's take a step-by-step look into forward propagation with the help of a small example. Consider the MLP below.

An example of an MLP

We first write the weight matrices ( $W^2, W^3$ ), and input vector ( $a^1$ ):

\begin{align*} W^2 &= \begin{bmatrix} w^2_{1,1} & w^2_{1,2} \\ w^2_{2,1} & w^2_{2,2} \end{bmatrix} = \begin{bmatrix} 0.15 & 0.20 \\ 0.25 & 0.30 \end{bmatrix}\\ W^3 &= \begin{bmatrix} w^3_{1,1} & w^3_{1,2} \\ w^3_{2,1} & w^3_{2,2} \end{bmatrix} = \begin{bmatrix} 0.40 & 0.45 \\ 0.50 & 0.55 \end{bmatrix}\\ a^1 &= \begin{bmatrix} a^1_1 \\ a^1_2 \end{bmatrix} = \begin{bmatrix} 0.05 \\ 0.10 \end{bmatrix}\\ y^3 &= \begin{bmatrix} y^3_1 \\ y^3_2 \end{bmatrix} = \begin{bmatrix} 0.01 \\ 0.99 \end{bmatrix} \end{align*}

Here $(y^3_1, y^3_2)$ are the given ground truth. Our goal is to compute the hidden layer nodes i.e. $(a^2_1, a^2_2)$ and also the predictions at the output node $(\hat{y}^3_1, \hat{y}^3_2)$ .

Lets start by computing the net input for layer $L_2$ or hidden layer, i.e $a_1^2$ and $a_2^2$ :

\begin{align*} a_k^l &= \sigma \left(\sum w_{jk}^l\cdot a_k^{l-1} + b_j^l \right) \\ a_1^2 &= \sigma \left(w_{11}^2 a_1^{1} + w_{12}^2 a_2^{1} + b_1^2 \right) \\ &= \sigma \left(0.015 \cdot 0.05 + 0.20 \cdot 0.10 + 0.35 \right) \\ &= \sigma (0.3775) \\ &= \frac{1}{1+e^{-0.3775}} \\ a_1^2 &= 0.593269 \end{align*}

Similarly computing net input for $a_2^2$ :

\begin{align*} a_2^2 &= \sigma \left(w_{21}^2 a_1^{1} + w_{22}^2 a_2^{1} + b_2^2 \right) \\ &= \sigma \left(0.25 \cdot 0.05 + 0.30 \cdot 0.10 + 0.35 \right) \\ &= \sigma (0.3925) \\ &= \frac{1}{1+e^{-0.3925}} \\ a_2^2 &= 0.596884 \end{align*}

Further at the output layer, i.e $\hat{y}_1^3$ and $\hat{y}_2^3$ :

\begin{align*} \hat{y}_1^3 &= \sigma \left(w_{11}^3 a_1^{2} + w_{12}^3 a_2^{2} + b_1^3 \right) \\ &= \sigma \left(0.40 \cdot 0.593269 + 0.45 \cdot 0.5968843 + 0.60 \right) \\ &= \sigma (1.1059059) \\ &= \frac{1}{1+e^{-1.1059059}} \\ \hat{y}_1^3 &= 0.751365\\ \hat{y}_2^3 &= \sigma \left(w_{21}^3 a_1^{2} + w_{22}^3 a_2^{2} + b_2^3 \right) \\ &= \sigma \left(0.50 \cdot 0.593269 + 0.55 \cdot 0.5968843 + 0.60 \right) \\ &= \sigma (1.22492) \\ &= \frac{1}{1+e^{-1.22492}} \\ \hat{y}_2^3 &= 0.772928 \end{align*}

Summarizing above computations, we have

\begin{align*} a_1^2 &= 0.593269\\ a_2^2 &= 0.596884\\ \hat{y}_1^3 &= 0.751365\\ \hat{y}_2^3 &= 0.772928\\ \end{align*}

Recall that $a_j^{l}$ is the output of $j^{th}$ neuron in the $l^{th}$ layer. Therefore, $a_1^2$ and $a_2^2$ represent output from $L_2$ layer, and $\hat{y}_1^3$ and $\hat{y}_2^3$ represent activation to the last layer. This computation represents one-full forward propagation. Further, in the next step error is computed, and backward propagation takes place.

Conclusion

As a result, you are now familiar with the following:

Perceptron is the simplest artificial neural network unit that makes binary predictions by comparing the weighted sum of numeric inputs to a threshold, mimicking a biological neuron.
MLPs are inspired by biological neural networks where neurons communicate signals through synaptic connections that are strengthened by learning.
MLPs exhibit a hierarchical architecture with input, hidden, and output layers, allowing them to model intricate patterns and relationships within data.
Computational graphs allow simple functions to be combined to form quite complex models in neural networks.
MLP networks involve the sequential calculation of outputs from input to output layers, via hidden layers where each neuron applies a weighted sum of inputs, adding bias, and applying an activation function.

How did you like the theory?

Report a typo