Computer scienceData scienceMachine learningComputer visionConvolutional neural networks

AlexNet

4 minutes read

In 2012, AlexNet surpassed the state-of-the-art results of the ImageNet Large Scale Visual Recognition Challenge by a significant margin, using an 8-layer convolutional neural network. This was the first time that learned features were shown to outperform classifiers with manual feature design, which was the default approach to computer vision tasks at that time.

In this topic, we will take a closer look at the architectural design of AlexNet and see whether the specific components of the architecture are still relevant more than 10 years after its introduction.

The architectural setup of AlexNet

The diagram of the original architecture was presented in the paper as follows:

The AlexNet architecture — Note: the peculiar image cropping is present in the paper

AlexNet consists of 8 layers (5 convolutional layers and 3 fully-connected layers). In the summary below, CONV corresponds to a convolution, POOL corresponds to maximum pooling, and FC is the fully connected layer:

[CONV -> POOL] -> [CONV -> POOL] -> CONV -> CONV -> [CONV -> POOL] -> [FC, FC, FC] -> SOFTMAX

Here, we will describe the hyperparameters and the operations of the model (the reasons for this detailed explanation will become clear in the last section of this topic). The input is of a fixed size of $224 \times 224 \times 3$ (a color image of 224 by 224 pixels). The first layer performs a convolution with a filter of $11\times 11$ and a stride of $4$ . It is followed by a maximum pooling operation with a $3 \times 3$ window and a stride of $2$ (such that the poolings overlap). As a reminder, maximum pooling is a downsampling operation that reduces the number of trainable parameters, essentially making the model simpler and less prone to overfitting.

The tricky thing with the dimensionality of the feature map after the first layer

As an exercise, try to see what is happening with the dimensions of the first convolutional layer if we consider the usual formula for calculating the output of a convolution (with $n = 224$ denoting the input height/width, $F=11$ being the filter size, $S = 4$ as the convolution's stride, and $P$ being the padding, which was not mentioned in the original paper, so we can assume it to be set to 0 for now):

$\frac{n-F + 2P}{S} + 1$

If you plug everything into the formula above, you'll notice that the result is a float (instead of an int, which would make the operation valid). It's widely assumed that the actual input in AlexNet was $227 \times227$ instead of $224 \times 224$ and either $224\times 224$ was a typo, or the padding details were skipped by the authors.

The activation function for the hidden layers is ReLU. In the second layer, the convolution filter is reduced to $5 \times 5$ , and from the third to the fifth, the filter is $3\times 3$ . Maximum pooling is present in the first, second, and fifth layers. After the fifth layer, there are three fully connected (FC) layers. The first two FCs are regularized with dropout (turning off, also known as setting to zero, randomly chosen half of the neurons during training so that they are not involved in the forward or backward pass). The output of the last FC is fed to a 1000-way softmax which produces a distribution over the 1000 class labels.

For training, AlexNet uses SGD with momentum and weight decay. As a side note, AlexNet introduced grouped convolutions, which have multiple filters per layer, resulting in multiple channel outputs per layer (which helps the network to learn a more varied set of features). However, this was also partially done to distribute the computations across two GPUs.

A brief note on the motivation behind the architecture design

In 2012, the main problem was that the world at large was ending, but there were three peripheral issues particularly relevant to AlexNet: the data, the hardware limitations, and the risk of overfitting.

AlexNet was trained on 1.2 million labeled samples and used various image augmentations to prevent overfitting (and getting more data is the first go-to solution for that), alongside other techniques. The network size for AlexNet was primarily constrained by the GPUs available at the time and the tolerable time for training itself. Even the non-conventional training parallelization (which we won't consider in detail here), written in pure CUDA, resulted in 5 to 6 days of training. To quote the paper itself: 'All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available' (which indeed turned out to be the case later on).

AlexNet in time: what is still around?

Deep learning and computer vision fields have grown rapidly over the past 10 years, so, in this section, let's quickly look at the components of AlexNet and whether they stood the test of time. More specifically, let us consider the following aspects:

ReLU as the activation for the hidden layers and softmax for the output;
Local response normalization;
Large convolutional filters ( $11 \times 11$ and $5\times 5$ to an extent);
Overlapping (maximum) pooling;
The presence of 3 fully connected layers after the convolutional layers.

ReLU (and variants) were the successors of the problematic tanh and sigmoid activation functions and are still widely used to this day. Tanh was actually mentioned in the original paper, but it reached the 25% training error six times slower than ReLU on a 4-layer convolutional network, and training speed was of great importance at the time.

Local response normalization (LRN) was applied in certain layers after the ReLU to improve generalization. At a high level, LRN normalizes the activations of neurons in the same layer, which makes the response of the layer less sensitive to changes in the input. LRN is not that widespread today, and batch normalization is more common.

The usage of large convolution filters ( $11\times 11/5\times 5$ ) was motivated by an assumption that the feature map size should be large at the beginning and decrease over the layers, and also that larger resolution images require larger feature maps. Nowadays, the most common setting seems to be $3 \times 3$ filter with a stride of 2.

Overlapping pooling provided a way to downsample the image, and pooling in general seems to have faded away and been replaced by strided convolutions. Fully-connected layers at the end have stayed, although there are usually fewer than 3 FCs in later architectures (not to say using 3 FCs is not present, for example, VGG-19 (2015) also uses 3 FCs at the end, but it's just less common). FCs introduce a lot of parameters and take up a considerable portion of the total trainable parameter count, and they could also be replaced with a convolutional layer, which significantly reduces the parameter count.

Conclusion

As a result, you are now familiar with the outline and the motivation for the AlexNet architecture design, and whether certain techniques used in AlexNet are applicable today.

How did you like the theory?

Report a typo

AlexNet

The architectural setup of AlexNet

A brief note on the motivation behind the architecture design

AlexNet in time: what is still around?

Conclusion

Related topics