Computer scienceData scienceMachine learningData preprocessingFeature engineering

Autoencoders

8 minutes read

The autoencoder is a type of a neural network used primarily for data compression. It is designed to learn the underlying representations of the input data, often called encodings, by training the network to approximate its own inputs. It is built from two main parts: the encoder, which learns to compress the input data into the encoding, and the decoder, which learns to reconstruct the original data from the encoding. This forces the autoencoder to perform dimensionality reduction and learn the most important features in the input data.

This topic introduces the general idea of an autoencoder and covers some of of it's main variants.

The general overview

The autoencoder consists of two parts: the encoder and the decoder. The encoder compresses the input data into a lower-dimensional representation, referred to as the latent space or the bottleneck. The encoder is represented as a function $h=f(x)$ , which maps the input data $x$ into a compressed representation $h$ . The decoder tries to reconstruct the data from the lower-dimensional representation to the approximation of it's original form. It is a function $g(f(x))$ , which outputs a reconstruction ( $\hat{x}$ ) of the original input $x$ . The encoder and the decoder can be fully connected or convolutional networks (in case of working with image data).

Given this, a regular autoencoder architecture could be presented as follows (the dimensions are arbitrary in this case, and it's also possible to add more hidden layers, but we will discuss it a bit further down the line):

An example of autoencoder architecture

The objective of an autoencoder is to minimize the difference between the input and the output, this difference is the reconstruction error. In the case of the regular autoencoder, the reconstruction error is given as

L(x, g(f(x))),

where $L$ is the loss function, most commonly, MSE is used, but there is no particular constraint on it.

Note that in this case, we are looking at the autoencoder which has a hidden layer (aka, the bottleneck) that is smaller that the input/output dimensionality. This autoencoder is referred to as an undercomplete autoencoder. The bottleneck ensures that the network won't memorize the inputs, because we don't aim to learn the identity function. Instead, what we want is to learn a meaningful representation (e.g., compression) of the input data.

Sparse autoencoders

There are some nuances w.r.t. the bottleneck, though. First, we can drop the dimensionality of the hidden layer, essentially constraining the information that goes through the full network (and hopefully, avoiding overfitting). The dropped dimensionality of the bottleneck does not guarantee avoidance of learning the identity, and we can only rely on constraining the bottleneck enough (but if the capacity of the encoder/decoder is high, one can still end up learning the identity even with even a very small number of hidden units).

If the bottleneck keeps the input/output dimensionality (or contains more units), it will simply learn the identity function, which is not desirable. However, we can apply regularization to avoid overfitting, and thus, the loss will look like

L(x, g(f(x))) + \text{regularizer},

where the reconstruction error (first term) corresponds to learning a meaningful representation of the inputs, and the regularizer prevents learning the identity. There aren't any specific constraints on what the regularization is, e.g., L1 or KL-divergence could be applied. This loss introduces a trade-off: the autoencoder should reconstruct the input well, and also, the representation should be meaningful and generalizable.

This regularized version is typically referred to as the sparse autoencoder. It's important to note that in sparse autoencoders, the activations are regularized (instead of the weights), thus making the network learn the encoding and the decoding by activating only a fraction of neurons. So, for example, the loss with the L1 activation regularization can be defined as

L(x, g(f(x))) + \lambda \sum\limits_{i}|a_i^{(h)}|,

where $i$ indexes the observations for an activation $a$ in the $h$ 'th layer. $\lambda$ is a scaling hyperparameter.

As a fun fact, PCA can actually be considered as a linear case of an autoencoder, by making the encoder and the decoder linear (e.i., keeping the activations linear) and using the MSE loss. Autoencoder is a stronger version of PCA (due to the non-linearity) that has some additional features (such as image reconstruction).

A comparison between the autoencoder and PCA

Looking at other variants

We have looked at the undercomplete and sparse autoencoders, which aim to approximate the original input $x$ , learning the underlying data distribution along the way (e.g., having an autoencoder that can generalize).

We could approach generalization differently: what if the setting is changed so that the input is different from the output such that we are constraining the autoencoder from learning a perfect mapping? This input-output difference is introduced by slightly corrupting the input via random noise injection, but we still wish to approximate the original input (the input before noise injection). In essence, this is what a denoising autoencoder does.

Denoising autoencoder suppresses the added noise and ideally describes the core features of the original data in a lower dimension. A denoising autoencoder minimizes the following loss:

L(x, g(f(\tilde{x})))

where $\tilde{x}$ is the corrupted version of the input $x$ .

Another popular variant is a contractive autoencoder, which explicitly runs under the assumption that similar inputs should produce similar encodings. More formally, the autoencoder is trained with a requirement that the derivatives of the activations in the hidden layer should be small with respect to the inputs (such that small input changes result in similar encoder outputs).

How is it different from the denoising autoencoders? The denoising autoencoder focuses on making the decoder insensitive to the small perturbations in the input (the corrupted inputs and their corresponding not corrupted outputs), while the contractive autoencoder makes the encoder less sensitive to the small changes in the inputs.

To construct a contractive autoencoder, the regularization term that penalizes large derivatives of the activations with respect to the input is added to the loss (such that small input change should not result in a large change of the encoding space). The contractive autoencoder loss is given as

L(x, g(f({x}))) + \lambda \sum\limits_{i}||\nabla_xa_i^{(h)}(x)||

where $\nabla_xa_i^{(h)}(x)$ is the gradient of the activations ( $a$ ) in the hidden layer ( $h$ ) with respect to the input ( $x$ ).

Conclusion

As a result, you are now familiar with the general overview of the autoencoder architecture, it's main types, and some of the applications.

How did you like the theory?

Report a typo