Computer scienceData scienceMachine learningComputer visionConvolutional neural networks

Pooling

5 minutes read

Pooling, just like convolution, is one of the most common operations in a large number of historical convolutional neural network architectures. In this topic, we will take a look at pooling and consider whether this operation is required today.

The basics of the pooling operation

In the most simple case, pooling reports a statistic (e.g., min, max, or average) for some region of the input and is used to downsample the input, while assuming some translation invariance (smaller changes in the input should not cause major changes in the output). Pooling is historically performed after a convolution layer.

The two most popular types of pooling are max pooling and average pooling, which report the maximum and the average value for some fixed windows, respectively:

Source: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks
Maximum pooling on a 6×66\times6 input and a pooling size of 22

Average pooling on a 6×66\times6 input and a pooling size of 22

Illustration of max pooling Illustration of average pooling

One can think of pooling as a matrix operation. Similar to a convolution operation, pooling requires a window shape (F×FF\times F) and a stride value (SS). Most commonly, a window of 2×22\times2 or 3×33\times3 with a stride of 2 is seen. A thing to note about the strides is that if the poolings don't overlap (which corresponds to a high stride value greater or equal to the pooling window size), it leads to a quick dimensionality drop, but also to a loss of spatial information.

Pooling does not have any trainable parameters (being a fixed statistical operation), so a natural question is: how does backpropagation through a pooling layer happen? As an example, let's consider an input of 2×22\times2 and a 2×22\times2 max pooling:

A 2 by 2 input to a pooling layer with a forward and a backward pass illustration

The maximum value here is located at (2,2)(2, 2). During the forward pass, the output will have all values set to 00, except the value at (2,2)(2,2), which will be equal to 11. During the backward pass, the 11 present at (2,2)(2,2) gets multiplied by the gradient passed by the next layer, so that after the backward pass we will get a 2×22\times2 matrix with zeros everywhere except (2,2)(2,2), which will store the gradient of the next layer.

Max pooling helps to preserve sharper and stronger features (e.g., the edges). Average pooling, on the other hand, tends to extract smoother features (due to the nature of the operation) and also has more gradient flow than max pooling.

Representing pooling through a convolution

Pooling is very similar to a convolution, but without the trainable parameters that a convolution possesses. In fact, average pooling is equivalent to a strided convolution with a fixed filter size.

Considering an average pooling of 3×33\times3 (non-overlapping case, a stride of 3), its corresponding convolution can be represented as a filter of 3×33\times3 with a constant value of 1/91/9 in each cell and a stride of 3 (to preserve the non-overlap).

Max pooling can't be directly represented as a convolution in a very straightforward manner. Essentially, this substitution of a pooling layer with a convolution performs a similar downsampling operation but introduces trainable parameters that help to retain more feature information.

Various pooling types

In this section, we will briefly look at more recent pooling developments that have emerged over the years. Namely, let's consider global average pooling and adaptive pooling.

Global average pooling is a special case of average pooling, where the filter size is equal to the size of the input feature maps. It was designed to replace a fully connected layer (which is prone to overfitting) at the end of CNN architectures and somewhat mimic a convolution. In global average pooling, the average of each feature map is taken, and this vector is fed into the softmax layer, which builds a correspondence between a feature map and the categories.

Adaptive pooling (with two variations, adaptive and max) was designed to handle arbitrary sizes of inputs, where the output shape is specified beforehand, and adaptive pooling adjusts the shrinkage factor depending on the size of the input. For example, suppose the output should be 3×33\times 3, and we have a 6×66\times6 input, then the pooling window would be 2×22\times2 because 6/2=36/2=3, but if the input is 15×1515 \times 15 and the output should remain 3×33\times3, the pooling window will be 5×55\times5. In the illustration below, there are three inputs of different dimensions (N1N_1, N2N_2, and N3N_3), the output dimensions are fixed, and the pooling windows vary between the three inputs.

The illustration for adaptive pooling

Adaptive pooling is used when handling variable-sized inputs is a requirement, for example, in object detection, when dealing with bounding boxes that can often be of different sizes.

The bad and the ugly of pooling

Geoffrey Hinton, a prominent figure in modern deep learning, had the following take on pooling during a reddit AMA:

The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.

Nowadays, we really don't see pooling that often. So, what are the issues with this seemingly innocuous operation?

As a starting point, pooling essentially performs downsampling (reducing the number of trainable parameters) — this was particularly important back when GPUs were not that powerful. However, nowadays, that is no longer a major concern.

Furthermore, pooling appeared not to improve performance (and sometimes degraded it altogether) and led to a loss of information along the way (basically underfitting the data). This issue can be addressed by replacing pooling with a convolution (as mentioned earlier), which helps to preserve the information and can lead to better performance.

Now, let's delve into the specifics with respect to some assumed properties of pooling. Pooling operates under the assumption that units should be invariant to small translations. This assumed invariance intuitively means that precise location (knowing where exactly the object is) is lost along the way, and the issue becomes more prominent if the poolings don't overlap. So, if the task requires location awareness, pooling is naturally out of the picture.

Speaking of pictures, it turns out that pooling, while providing some translation invariance, is actually not that good at it:

Passing an image of a truck through a simple CNN with max pooling and obtaining confidence scores

Here, we have a (quite non-obvious) image of a truck that is being slightly shifted, and the numbers correspond to confidence scores (obtained via prediction from a simple ConvNet with max pooling) that the image indeed depicts a truck. The confidence scores fluctuate significantly from 35% to 99%. There is an interesting paper called 'Making Convolutional Networks Shift-Invariant Again' that goes into further detail and proposes anti-aliasing pooling to address this actual translation invariance, drawing inspiration from signal processing.

Conclusion

As a result, you are now familiar with the basic idea of the pooling operation, some later variations (e.g., global average and adaptive poolings), and whether the usage of pooling is required at all at this point.

How did you like the theory?
Report a typo