VGGNet is a classical image recognition network that was introduced in 2014 and heavily builds on the notions from AlexNet. In this topic, we will look at the architectural layout of VGG and how it compares to AlexNet.
The motivation for VGG
VGG is strikingly similar to AlexNet in terms of the building blocks, the main difference between them comes from the number of layers and the number of trainable parameters. In their original paper in 2012, the authors of AlexNet wrote
'All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available'
As we will shortly see, this was partially the case for VGG, which was trained on 4 GPUs, although of a newer variety (vs the 2 GPUs in AlexNet) over the course of 2-3 weeks (compared to five to six days of AlexNet training, mainly due to ImageNet competition-related time constraints). The second assumption was that deeper networks (containing more layers) lead to better performance.
Below is a summary table for the components of AlexNet and whether they are present in VGG (we will elaborate on the details in the upcoming section):
| Present in VGG? | ||
| Convolutions with a large filter (11x11 and 5x5) | Specifically, 11x11 and 5x5 | No |
| Activation function | ReLU | Yes |
| Optimizer | SGD with momentum and weight decay | Yes |
| Dropout | - | Yes |
| Local response normalization (LNR) | - | No |
| Overlapping maximum pooling | - | No (although maximim pooling itself is present) |
The architectural setup
VGG consists of stacked convolutional and maximum pooling layers, with the spatial size of the feature maps decreases over the layers, and the number of channel grows. The convolutional layers are followed by three fully connected layers and a softmax at the end to produce a prediction.
The most important change between AlexNet and VGG was the depth of the network. VGG used 16 hidden layers and 144 million parameters (as compared to 8 layers and 60 million parameters of AlexNet).
Just like AlexNet, VGG accepts a fixed input of , a standard CIFAR-10 image.
The divergence from AlexNet starts from using smaller filters of and a fixed stride of 1 for all convolutional layers, while AlexNet had an filter with a stride of 4 in the first layer and a filter for the second layer. Basically, large filter sizes did not prove to be effective (also has to do with the actual receptive field size, which can be expanded faster with a smaller filter), so they were replaced by the (almost) universal ones, which will continue for other CNNs to come after VGG. The spatial resolution of convolutional layer input is preserved after convolution (the padding is 1 for convolutional layers).
The next difference comes from how the maximum pooling is performed. Maximum pooling in VGG is performed over a pixel window, with a stride of 2, making the spatial dimensions of the output half of that of the input (downsampling the height and width by a factor of 2). This aggressive downsampling allows VGG networks to have fewer parameters and to compute much faster than if they used larger pooling windows or smaller strides.
VGG operates in VGG blocks, where a block is a combination of the sequence of convolutions, followed by ReLU, and followed by the maximum pooling.
The purpose of the successive convolutional layers is to continually extract higher level features from the input. The maximum pooling layer following these serves to down-sample the output from the convolutional layers. The original VGG has five VGG blocks, where the first block has 64 output channels and every block after it doubles the number of output channels until it reaches 512.
Local response normalization (LRN) was a technique of AlexNet to normalize the activations of neurons in the same layer, which makes the response of the layer less sensitive to changes in the input. The authors of VGG did not find it to improve the results when benchmarking the more shallow versions of the architecture, so it was not included in the deeper VGG model.
Just like AlexNet, VGG uses softmax as an activation for the output, and SGD with momentum as an optimizer. Weight decay (L2 regularization) and dropout are used to prevent overfitting, which was also present in AlexNet.
Considerations for VGG
VGG is quite heavy and slow with 140 million parameters. The interesting thing is that most of these parameters come from the fully connected layers and not from the convolutional ones (the first fully connected layer has 100 million weights, and the last two fully connected layers contribute another 20 million weights). As you might know, a fully connected layer can be converted into a convolutional one, and later architectures do not use so many fully connected layers (since they do not improve the results and bring in a lot of parameters).
One of the main contributions of VGG is the idea of stacking blocks of multiple convolutions in the architecture, which was later present in the successive CNN architectures. VGG exploited the idea that deeper models will result in better performance, which was later challenged by other models (e.g., ResNet or Inception, essentially with more advanced structures, but fewer layers and less depth).
Conclusion
As a result, you are now familiar with the main building blocks of the VGG architecture and how it compares with AlexNet, it's known nuances, and the ideas it brought forward.