Computer scienceData scienceMachine learningComputer visionConvolutional neural networks

MobileNet

18 minutes read

MobileNet is a lightweight CNN architecture designed for mobile and embedded applications. Its core features consist of depthwise separable convolutions, which significantly reduce the number or parameters, and the introduction of two hyperparameters that help to further reduce the computational costs, namely, the width multiplier and the resolution multiplier.

In this topic, we will primarily focus on the three features described above and consider the general architectural setup.

An introduction to the depth-wise separable convolutions

Before we jump into the MobileNet architecture itself, we should start by looking at its main additions.

In terms of the architecture, MobileNet follows the usual convolutional network setup, although with a few nuances:

A brief overview of the MobileNet architecture

You have the convolution, followed by the depth-wise convolution, followed by the point-wise convolution (a $1\times1\times D$ convolution with $D$ corresponding to the depth of the input feature map, which mainly adjusts the number of channels in the feature maps), the last two of which compose the depth-wise separable convolution. It’s important to note that the architecture mostly runs on the depth-wise separable convolutions, but we will get there shortly.

To get to the depth-wise separable convolution, we have to understand what is the depth-wise convolution and how is it different from the regular one. In a regular convolution, the filter extends through the full depth of the input, so if the input has a depth of $D$ , then the filter shape is actually $n \times n \times D$ . In a depth-wise convolution, each filter is applied to each individual input channel separately. For example, if you have an input with 128 channels, you would apply 128 separate filters, each processing one of the input channels (thus, the shape of the filter in this case will be $n\times n \times1$ ). The depth-wise convolution reduces the computation cost of the standard convolution, but there is a catch: the input channels are simply filtered channel-by-channel, not combined. The standard convolution operation filters the features based on the convolutional filters and combines them to produce new features.

This is where the motivation for the $1 \times 1$ point-wise convolution addition comes from: to combine the outputs of the depth-wise convolution and make the features cross-channel. Once we have combined the depth-wise convolution with the $1\times1$ point-wise convolution, we have almost arrived at the depth-wise separable convolution without going into too many details.

Source: https://animatedai.github.io/
Depth-wise convolution	Depth-wise separable convolution

Going into the specifics, as a reminder, a standard convolutional layer usually looks like a $n \times n$ convolution (suppose our n=3 in this case, and also, there is usually the depth parameter in the end, but we omit that), followed by batch normalization, followed by applying the ReLU non-linearity (scheme on the left). A depth-wise separable convolution layer typically does the $3\times3$ depth-wise convolution, followed by batchnorm, followed by ReLU, and then the results are combined back together with the $1\times 1$ point-wise convolution, followed by the batch norm, and followed by the ReLU (scheme on the right):

The structure of the regular convolution and the depth-wise separable convolution

We briefly mentioned how the purpose of the depth-wise separable convolution is to reduce the computational costs, but it’s important to see just how much cost reduction is happening due to the usage of the depth-wise separable convolutions.

Suppose a regular convolution (denoted by option 'a' below) has $M$ input channels, $N$ output channels, a $D\times D$ filter, and a $F \times F$ feature map (basically, the spatial dimensions of the output).

The standard convolution, death-wise separable convolution, and the point-wise convolution — a — regular convolution;
b — depth-wise convolution;
c — point-wise convolution.

Then, the computational complexity of the regular convolution will be equal to

$D^2\cdot N \cdot M \cdot F^2$

For the depth-wise convolution (option 'b' in the illustration above), the computational cost will be $D^2 \times M \times F^2$ , but because the depth-wise convolution has to be followed by the $1\times1$ (point-wise convolution, denoted by 'c' in the illustration above) to combine the outputs, the depth-wise separable convolution has a computational cost of $D^2 \cdot M \cdot F^2 + M \cdot N \cdot F^2$ . The reduction in computation from the original convolution will be $1/N + 1/D^2$ , and the depth-wise separable convolution will require $\approx 1/8$ of the computation of the regular convolution, while the accuracy drop is only marginal.

But MobileNet’s optimizations don’t stop here.

Further reductions: width multipliers and resolution multipliers

The width multiplier is a hyperparameter that allows to scale the network's width, in case it should be faster and smaller (although there will be a tradeoff with accuracy, but not a severe one). The width multiplier, $\alpha$ , is a real number in the range of (0, 1] which thins the network uniformly at each layer.

For a given layer and width multiplier $\alpha$ , the number of input channels M becomes $\alpha \cdot M$ and the number of output channels N becomes $\alpha \cdot N$ . If $\alpha$ is 0.5, the computational cost would be approximately 0.25 times the original, assuming the computational cost is proportional to the number of operations which is roughly proportional to the square of the number of channels.

The second hyperparameter that reduces the computational cost is the resolution multiplier ( $\rho$ ), which is a real number in the range of $(0, 1]$ and is used to reduce the size of the input image, and consequently, the internal representation at each layer of the network. For instance, if the original input size is $224 \times 224$ and the resolution multiplier $\rho$ is 0.5, the new input size would be $112 \times 112$ .

Both the width and the resolution multipliers allow for control of computation costs but introduce the accuracy trade-off.

Considerations for the network

MobileNet diverges from the usual large model regularization by using less data augmentation, dropping label smoothing, and applying little to no weight decay on the depth-wise filters (harsh regularization is just not required when there are already so few parameters, and overfitting, in general, is much less of a concern for the smaller models).

In total, MobileNet has 28 layers (if the depthwise and the pointwise convolutions are factored in), batch normalization and ReLU follow every layer except the fully connected layer, which does not have the non-linearity and is fed into a softmax. Average pooling precedes the fully connected layer to reduce the spatial resolution to 1.

Conclusion

As a result, you are now familiar with:

The main distinctive features of the MobileNet architectures, namely, depth-wise separable convolutions, width multiplier, and resolution multiplier;
The depth-wise convolution applies each filter to each individual input channel separately, and later has to be followed by a $1\times 1$ convolution to combine the outputs and generate new features, and this combination becomes the depth-wise separable convolution, which effectively reduces the computational costs;
Width and resolution multiplier are two hyperparameter that help to further control the size of the network and make it even lighter, albeit these two have a trade-off with accuracy.

How did you like the theory?

Report a typo