Inception (also known as GoogLeNet) was first introduced in 2014 and brought the idea of combining operations into blocks for design simplicity. Later versions (e.g., V2 or V3) were also relevant for deep learning research moving forward. In this topic, we will see the core ideas behind Inception and review other models from the Inception family.
The general setup
Inception is a classic CNN that runs on the same core components as its predecessors—you have your stacked convolutions, pooling, softmax layers, etc. So, what makes it different from, let's say, VGG? Let's take a look at the simplified Inception architecture (simplified in the sense that the original illustration is quite hefty, but all aspects are preserved from the original paper):
As usual, we start with the input, perform a series of convolutions with maximum pooling and local response normalization (LRN). LRN's main purpose is to improve generalization by normalizing the activations of neurons in the same layer, making the response of the layer less sensitive to changes in the input. Then, we enter the first Inception block. The main idea of the Inception block is that we don't have to choose the step (either a convolution with a specific filter or a pooling) manually and can do them all by putting multiple filters and a pooling operation inside of a single block (here we have 3x3, 5x5, 1x1 filters, and a pooling operation; the specifics of applying a 1x1 before the convolution or after the pooling will be considered closer in the next section). This essentially enables the network to learn the appropriate step on its own.
Next, after multiple Inception blocks, the output gets branched: there is an Inception block, and the auxiliary classifier, marked in yellow. In Inception, the training loss is a sum of three components — the outputs of the two auxiliary classifiers (weighted, so they contribute less — the original paper gives a weight of 0.3 to each), and the "normal" loss right at the top of the architecture (it's important to note that this only applies to training, and the auxiliary classifiers are dropped at inference). Essentially, the auxiliary classifier tells how good the intermediate Inception block is doing as far as solving the final problem — e.g., we could stop right after the first three Inception blocks; what would the result be? What this does is help the network bring the gradients to the earlier layers in the network during training — the original path is long, so the gradients vanish easily. Later, it was argued that the auxiliary classifier provides further regularization (especially if combined with BatchNorm or dropout) and doesn't help that much with increasing the gradient signal in the lower layers, but we won't go into the specifics of that.
In the upcoming section, we will delve deeper into the Inception block and some other architecture design choices.
A closer look at the Inception block
The naive version of the Inception block can be represented as follows:
What happens here is that the results of every operation are concatenated. Consider a small example of a 28x28x256 input to an Inception block:
A minor comment on the pooling in the above illustration is required. Maximum pooling will shrink the dimensions (such that if we want to stack the results, it won’t be possible). This is resolved by making the stride to be 1 (instead of 3) and adding half-padding. Another issue is that pooling can’t reduce the depth of the output — the output’s depth will be equal to the input’s depth, which might not be desirable. This is why there is a 1x1 convolution after the pooling layer in the ‘regular’ Inception module.
With this design, we arrive at a major problem from the get-go: the number of parameters blows up quite quickly (because 5x5 and 3x3 are expensive). Omitting the proper parameter calculation, this exemplary module ends up with approximately 1M parameters. So what the ‘regular’ Inception module does is heavy depth reduction via the 1x1 convolutions:
This addition of 1x1 convolutions will effectively drop the parameter count by 2/3, making our example contain approximately 300k parameters instead of 1M:
As a note, the choice of 3x3 and 5x5 is arbitrary, and there is no deeper meaning behind it. Also, formally, there are no restrictions on the specific convolution or pooling operation inside the Inception module; this is just the most common and classical configuration.
Considerations for the architecture
In this section, let's briefly review other architectural choices for Inception. Originally, it was trained with momentum GD, a fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs), all convolutions (including inside the Inception module) use ReLU as the activation, and softmax is present in multiple parts (the auxiliary classifiers and the ‘main’ classifier). The usage of adaptive pooling is mostly for convenience and simplified fine-tuning on other label sets. The network is 22 layers deep (27 if poolings are factored in) and contains around 100 layers (as independent building blocks).
Dropout is present both in the main classifier and in the two auxiliary classifiers (although the main one uses a 40% dropout rate and the auxiliary classifiers use 70%).
In the next section, we will see the architectural changes in the later iterations of Inception.
The Inception model zoo
Since its introduction in 2014, there have been multiple extensions upon the original architecture, namely, versions 2 to 4 and the combination with ResNet are the primary ones, although we will focus on the V2-V4 variants here. Below is a quick summary table with the main features and changes from the predecessors.
| Architecture | What was changed/added? | Introduced in |
| Inception-V2 | Batchnorm, convolution factorization | https://arxiv.org/abs/1512.00567 (BatchNorm is related to V2) |
| Inception-V3 | SGD is replaced with RMSProp, label smoothing, asymmetric convolution factorization | https://arxiv.org/abs/1512.00567 |
| Inception-V4 | Addition of stems after the input, Inception-(A, B, C) | https://arxiv.org/pdf/1602.07261 |
| Inception-ResNet | Skip connections, activation scaling | https://arxiv.org/pdf/1602.07261 |
The line between what is defined as V2 vs V3 is a little bit blurry (they were described in the same paper, but V2 was first mentioned in the BatchNorm paper). Inception-V2 was an intermediate architecture before V3 that introduced convolution factorization and also added BatchNorm (which was simply not published at the time when V1 was being developed). Performing larger convolutions (e.g., 5x5 or 7x7) is computationally expensive, so it was discovered that stacking multiple smaller convolutions will result in a larger effective receptive field and a reduction in the parameter count. For example, two 3x3 convolutions give a 5x5 receptive field, and stacking three 3x3 convolutions gives a 7x7 receptive field. For example, the original 5x5 has 25 trainable parameters, but two stacked 3x3 convolutions will result in just 18 (roughly slashing 1/3 of the parameters), while keeping the original receptive field.
Sounds good, now 3x3 filters are the most expensive ones. Can we do better? Turns out the standard 2D convolution can be further factorized into two 1D convolutions, generally speaking, the nxn convolution is broken down into 1xn followed by an nx1 convolution (thus, the asymmetric convolution factorization is added in Inception-V3). V3 also replaced momentum SGD with RMSProp (which simply showed better performance when compared with momentum). Also, V3 uses label smoothing as a form of regularization at the classifier level. Label smoothing makes the model less confident about the assigned scores by setting the value close to 1 for the actual class and redistributing smaller values among other classes (e.g., if there are 5 classes and we have [0,1,0,0,0] as the labels, they would become [0.05, 0.9, 0.05, 0.05,0.05]).
Inception-V4 made the Inception blocks uniform with respect to the grid size by making an Inception-A module for the 35x35 grid modules, Inception-B for 17x17 grid modules, and Inception-C for the 8x8 modules:
Another new component in V4 is the stem. The specific purpose of the stem is to perform initial processing right after the input (typically consists of a series of convolutional layers, pooling layers, and possibly normalization layers). The stem reduces the spatial dimensions of the input before it enters the Inception blocks, which helps to reduce the costs further into the network while preserving the main features.
Conclusion
As a result, you are now familiar with the following:
- The Inception network brought forward the idea of stacking multiple operations inside of a single (Inception) block, which enables simpler design and allows the network to learn the appropriate operation on its own;
- Later versions added various components (e.g., BatchNorm) and modifications, but the architectural design itself remained the same at its core.