EfficientNet, introduced in 2019, is a CNN architecture that mainly developed on the idea that main architectural settings in a CNN (namely, width, depth, and input resolution) are intertwined in some manner, such that they can be changed systematically, instead of previous approaches that changed these settings arbitrarily. This uniform approach allows to select the most appropriate settings under the predefined resource constraints and accuracy maximization objective.
In this topic, we will look at the improvement introduced in EfficientNet.
A quick look at model scaling
To start off, at the time of the EfficientNet, it’s been widely accepted that scaling models (e.i., making them bigger in some dimension) typically led to higher accuracy, but the details on how to properly scale models haven’t been well understood. Typically, only a single network setting would be changed (e.g., the number of layers), while other would be kept as is.
For example, ResNet was scaled from 18 layers to 200 (corresponding to ResNet-18 and ResNet-200, respectively) by appending more layers. This layer scaling is also referred to as depth scaling. Other architectures, such as Wide ResNet, scaled the width of the layers (so that the layers became wider). Another (less popular) approach was scaling the resolution of the inputs.
These three scaling dimensions are presented in the illustration below:
There were two issues with individual scaling. The first one — these three dimensions are not independent from each other (to give an example, if the resolution goes up, it makes sense to make the network both deeper and wider, because there are basically more details to capture when there are more pixels). The second issue is that during individual scaling, the accuracy tends to saturate (that is to say, past a certain point, the improvements diminish even if the dimension keeps being expanded). In the illustration below, the first plot shows how increasing the width affects the accuracy (note that the other two are fixed), the second and the third plots show the accuracy for the increased depth and input resolution:
One can observe that the accuracy approaches 80 at some point and stays there. This motivates the core proposal of EfficientNet — what if all 3 dimensions are scaled uniformly at the same time? This is what the compound scaling method does, which will be reviewed a bit further down the line.
Compound scaling
Before looking at the compound scaling, its helpful to layout the general structure of EfficientNet.
As mentioned previously, EfficientNet is a CNN that can be automatically designed with an objective to maximize the accuracy while adhering to the imposed resource constraints (namely, FLOPS and memory limitations). There is a baseline model, EfficientNet-B0, that was developed with multi-objective neural architecture search that optimizes both accuracy and FLOPS, and later, all expansions happen on this baseline network. B0 mainly consists of mobile inverted bottlenecks (MBConv), which has the following structure:
Squeeze-and-excitation (SE) optimization is also added, which improves the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. The process is: Conv block as input -> Average pooling (to 'squeeze' each channel) -> FC with ReLU -> FC with sigmoid -> Each feature map of the Conv block is weighted on the side network ('excitation'). SE mainly makes the network focus on the most important features.
The overall B0 architecture is given as
Later expansions are done with the compound scaling. Compound scaling uses a compound coefficient to uniformly scale network width, depth, and resolution:
where are constants that are determined by a grid search. is a user-specified coefficient that controls how many more resources are available for model scaling, while specify how to assign these extra resources to network width, depth, and resolution respectively.
Starting from the baseline model (B0), the compound scaling method is applied in two steps:
Fix , assuming twice more resources available, and do a small grid search of based on the objective of maximizing accuracy with constrained FLOPS and memory limits and .
Fix and scale the baseline up with different 's using , to obtain EfficientNet-B1 to B7.
And these two steps are repeated for every single variant until B7. The main point is that 'if there are additional resources available, you scale further to make more accuracy gains'. Below is a comparison plot of accuracy with 4 scalings on the EfficientNet-b0 (width/depth/resolution/compound), with compound scaling outperforming the other three:
Further architectural details
EfficientNet is trained on ImageNet with RMSProp, BatchNorm, weight decay, and a learning rate scheduler (that decays linearly). SiLU (Swish-1) is used as an activation. SiLU assigns small negatives to negative inputs (unlike ReLU, which zeroes them out), and tends to work better with deeper architectures. Regularization becomes stronger from B0 to B7 (the bigger the model, the more regularization it requires), so the dropout ratio is linearly increased from 0.2 (for B0) to 0.5 for B7. Early stopping is also present.
Conclusion
As a result, you are now familiar with the architectural layout for EfficientNet, the specifics of the compound scaling method, which can find optimal width, depth, and input resolution simultaneously, and how it affect the accuracy of the network.