Computer scienceData scienceMachine learningComputer visionConvolutional neural networks

Convolutions

7 minutes read

The convolutional layer is a fundamental component of a convolutional network, being responsible for carrying out a vast majority of the computational procedures. The operation within this layer involves applying a series of filters to the input data, thus enabling the training of the network to identify features from the data.

In this topic, we will look at the core ideas of convolutions.

The problem setup

Standard neural networks take in a single vector and transform it via a series of hidden layers. Each of these layers consists of multiple neurons, and all neurons are fully interconnected with the neurons in the preceding layer. The neurons in a specific layer function independently and do not share connections. The output layer corresponds to class scores in classification scenarios.

Within the CIFAR-10 image classification, the images are $32\times32\times3$ in size (32 width, 32 height, 3 color channels), therefore, one fully-connected neuron in the first hidden layer of a regular neural network would have $32\cdot32\cdot3 = 3072$ weights. This number may seem manageable, however, this will quickly degrade when scaled to larger images.

For example, a more realistic image, say $1080\times1080\times3$ , would result in neurons with 3499200 weights. Multiple neurons of this type would increase the parameter count, which is inefficient and especially prone to overfitting.

It’s in this kind of scenario that convolutions enter the picture.

A basic outline of a convolution

The basic mechanism of a convolution could be illustrated as follows:

The basic convolution — Source: https://github.com/vdumoulin/conv_arithmetic

The parameters of the convolutional layer are composed of a series of trainable filters. Each filter is small in its spatial dimensions (height and width), but spans the complete depth of the input volume. A typical filter on the initial layer of a convolutional network might be $5\times5\times3$ in size (for instance, 5 pixels in height and width and 3 three channels of color in images). In the forward pass, we move each filter over the height and width of the input volume, calculating the dot products of the filter and the input at each position. The sliding of the filter over the input's height and width results in a 2D activation map showcasing the filter's responses at all spatial points.

The network is expected to recognize filters that get activated when they encounter a certain visual feature (i.e., an edge) at the initial layer, or eventually complex patterns at the network's higher layers. We will usually have an entire set of filters in each convolutional layer (for instance, 12 filters). Each one produces a different 2D activation map. The output volume is then formed by stacking these activation maps along the depth dimension.

Receptive field

When working with inputs of high dimensions, it is wasteful to link all neurons to those in the previous volume, as previously shown. Instead, each neuron will only be connected to a small area within the input volume. The connection range is a hyperparameter called the receptive field. The extent of this connection along the depth axis is always equal to the input volume's depth. There is a discrepancy between how we handle the spatial dimensions (width and height) and the depth dimension: the connections are local in 2D space (along the width and height), but completely cover the input volume's depth.

For instance, consider that the input volume is sized $32\times32\times3$ , (like an RGB CIFAR-10 image). If the filter is $5\times5$ , then each neuron in the convolutional layer will be linked to a $5\times5\times3$ region within the input volume, with $5\cdot5\cdot3 = 75$ weights (plus one bias parameter). Note that the connectivity along the depth axis must be equal to 3 since this is the depth of the input volume.

Determining the depth and the stride

Though we've detailed how each neuron in the convolution is connected to the input volume, we've yet to discuss the total number of neurons in the output volume and their configuration. The size of the output volume depends on three hyperparameters: depth, stride, and padding, however, we will focus on padding in the upcoming topics. We will consider these next:

First, the output volume's depth corresponds to the quantity of filters used, with each learning to identify different aspects in the input. As an example, if the initial CONV uses the raw image as its input, distinct neurons along the depth dimension may become activated when exposed to diverse edge orientations or color clumps. We'll identify a cluster of neurons that all focus on the same area of the input as a depth column.
Secondly, we set the stride, which defines how we progress the filter. If the stride is 1, the filters move one pixel by one pixel. If the stride is 2, the filters skip two pixels at a time as they move around. This method yields smaller spatial output volumes.

Strided convolution — Source: https://github.com/vdumoulin/conv_arithmetic

We can cut down the parameters by making an assumption: the usefulness of one feature at a specific spatial position $(x,y)$ implies it would be of equal value at a different position $(x_2,y_2)$ . If we term a single 2D cross-section of depth as a 'depth slice' (for instance, a $55\times55\times96$ volume has 96 such slices, each being $55\times55$ in size), we will restrict neurons within the same depth slice to accommodate the same weights and bias. During backpropagation in practical application, every neuron within the volume will establish the gradient for its weights, which are summed across all depth slices and update only a single set of weights per slice.

When all neurons in a single depth slice utilize the same weight vector, the CONV layer's forward pass can be calculated as a convolution of the neuron's weights with the input volume within each depth slice. This is why these sets of weights are often referred to as a filter that is convolved with the input, thus giving rise to the term Convolutional Layer.

Some architectures make use of $1\times1$ convolutions, first examined by Network in Network. This is rather counter-intuitive in the context of traditional signal processing, where signals are 2-dimensional and $1\times1$ convolutions would be pointwise scaling. However in convolutional networks, there are three dimensions, and the filters traverse the full depth of the input volume. For instance, if the input is $32\times32\times3$ , applying $1\times1$ convolutions effectively results in 3-dimensional dot products because the input depth has three channels.

The usual equation to calculate the size of the feature map after applying a convolution is given as follows:

$\frac{W-F}{S} + 1$ where $W$ is the length / width of the input, $F$ is the filter size, and $S$ is the stride size. Here, we assume no padding is used. Note that there is a stride constraint — the equation above should be an integer, but PyTorch uses the floor operation (taking the integer part of the resultant float) to calculate the feature map dimensionality.

Dilation

We can also introduce one more hyperparameter to the convolutional layer called the dilation.

Dilated convolution — Source: https://github.com/vdumoulin/conv_arithmetic

Up until this point, we've only talked about contiguously-positioned filters. However, we can also have filters with gaps or spaces in-between, something known as dilation. For instance, in a single dimension, we may apply a filter of size 3 over input x like so:

$w_0x_0+ w_1x_1+w_2x_2$ This represents a dilation of 0. If we apply a dilation of 1 (corresponding to a gap of 1 between each application), the filter would calculate

$w_0x_0 + w_1x_2+ w_2x_4$ This can be beneficial in certain scenarios when used alongside 0-dilated filters, as it allows for a more aggressive merging of spatial information from the inputs, but with fewer layers. For instance, consider stacking two $3\times3$ convolutional layers - the neurons in the second layer function based on a $5\times5$ input patch (thus having an effective receptive field of $5\times5$ ). Using dilated convolutions expands this effective receptive field at a much faster rate.

Conclusion

As a result, you are now familiar with the basics of the convolution operation, its main components, and certain considerations for the hyperparameters.

How did you like the theory?

Report a typo