Computer scienceData scienceMachine learningGenerative modeling

Variational autoencoders

10 minutes read

Variational autoencoders (VAE), introduced in 2013, are generative models that extend upon the regular autoencoder in a couple of ways. Instead of encoding the inputs into a single value, the encoder outputs a probability distribution, making it more powerful for generative tasks.

In this topic, we will provide the general outline for variational autoencoders and take a closer look into the architectural specifics.

The VAE overview

The variational autoencoder structure itself resembles the autoencoder: you have the encoder, the latent space (the output of the encoder, which is split into the mean and the variance in the illustration below, but we will get to it shortly), and the decoder:

The general scheme for VAE

Perhaps the most important difference lies in the latent space. In the classical autoencoder, the goal is to learn the attributes of the inputs in a compressed space. Each observation could be described as a vector of (discrete) values, where each value represents some information about the input in some encoding (e.g., if the autoencoder is trained on digits, different values might represent the presence of certain strokes, circular shapes, etc). This makes the latent space discrete (which becomes more apparent in higher dimensions) — in turn, when you are randomly sampling from it to pass to the decoder, it’s not guaranteed that the output of the decoder will be meaningful (it might have never seen that region of the latent space during training). The distribution might be arbitrary, not centered at 0, thus, it’s hard to sample from the latent space of the autoencoder. Due to this reason, while autoencoders are technically capable of generating new data, they are mainly applied in dimensionality reduction.

On the other hand, VAE’s encoder produces a continuous probability distribution (specifically, the latent space follows a standard multivariate Gaussian distribution), making it much easier to sample from (for any sampling from the latent space, we expect that the decoder will be capable of accurately reconstructing the input, and thus, if the values in the latent space are close, they should produce very similar outputs).

In the next section, we will formalize this intuition.

Note: in this topic, we are only considering the classical VAE, which can't produce outputs of a specific class (looking at MNIST, the classical VAE will only generate random digits), however, the conditional VAE can accept a label as a part of it's input, which allows to have more control over what is being generated.

Stepping into the structure

Recall that in case of an autoencoder, the input (assuming an image for consistency), after being passed through the encoder, corresponds to a single point in the latent space. We can describe the process for building the encoder for the VAE as follows. We want to build the latent space in Rd\R^d, where every input (assuming the input is an image) xx corresponds to a (Gaussian) distribution in that latent space. The Gaussian distribution can be described by two vectors: the vector of means and the vector of variances (both of size dd, since we are operating in Rd\R^d). Then, the encoder can be defined as

encoder(x)=(μ(x),σ(x)),\text{encoder}(x) = (\mu (x), \sigma(x)),

where μ\mu and σ\sigma are the mean and the variance for every dimension in Rd\R^d. At this point, for each input, the encoder outputs two values, which describe the distribution in the latent space. The difference between encoding with the standard autoencoder and the VAE can be illustrated as follows:

A comparison between a standard autoencoder and VAE

After the encoder has described the distribution, we sample a vector zz from it, and pass it to the decoder:

decoder(z)=x~\text{decoder}(z) = \tilde{x}

The decoded image x~\tilde{x} should be similar to the input image xx for every vector samples from the obtained distribution.

Another illustartion of the VAE process

The VAE objective

The VAE objective can be represented as follows:

Ezqw(zx)[logp(xz)]DKL(qw(zx)p(z))max\mathbb{E}_{z \sim q_w(z|x)}[\log p(x|z)] - D_\text{KL}(q_w(z|x)||p(z)) \rightarrow \max

Alternatively, the VAE loss (since the loss is typically minimized), is given as

L=Ezqw(zx)[logp(xz)]+DKL(qw(zx)p(z)),L = -\mathbb{E}_{z \sim q_w(z|x)}[\log p(x|z)] + D_\text{KL}(q_w(z|x)||p(z)) ,

where

  • q(zx)q(z|x) is the encoder, typically a fully connected network or a CNN. qq can be seen as the probability that zz is the embedding for xx in the latent space;

  • p(xz)=decoder(z)+ϵp(x|z) = \text{decoder}(z) + \epsilon , the decoder, also typically a fully connected or a CNN. pp can be seen as the probability that the decoder will decode zz into xx;

  • Ezqw(zx)[logp(xz)]\mathbb{E}_{z \sim q_w(z|x)}[\log p(x|z)] can be interpreted as if xx was passed through the encoder, all possible representations in the latent space have been generated, and the average reconstruction loss was calculated (this term can be also interpreted as the reconstruction loss at the high level);

  • DKL(qw(zx)p(z))D_\text{KL}(q_w(z|x)||p(z)) — the KL-divergence, which measures the difference between the distribution in the latent space and the standard normal distribution. To put it differently, we want q(zx)q(z|x) (the output of the encoder) to follow the standard normal distribution as close as possible. KL-divergence acts as a regularizer in the objective.
    The KL-divergence here is in the reverse-mode (more on that here), and it's used because it does not result in zero gradients if the two distributions do not overlap.

Below is a helpful illustration on what happens when the VAE is trained on the terms of the objective separately and on the actual (full) objective:

The comparison between training on the reconstruction loss, KL-divergence, and the full VAE objective

When trained on the reconstruction loss only (the first image), VAE ends up being reduced to the regular autoencoder. You can notice that there are quite a lot of empty regions (the discrete property mentioned in the beginning of the topic, it might not appear that discrete from the first sight, but it's primarily due to the fact that the visual is in 2D). Training only on the KL-divergence makes the latent space "explode", and there is no apparent structure of the latent space. However, the combination of the both terms produces a more dense latent representation.

The reparametrization trick

There is an important nuance when it comes to the random sampling from the obtained distribution. During training, backpropagation has to be performed, but when randomness is present, the process is not that straightforward. The reparametrization trick is introduced to tackle this:

The reparametrization trick

Here, the randomness gets separated into a different part of the model that is not present in the backpropagation. The generator samples ϵ\epsilon from the unit Gaussian, after which, given the vectors μ\mu and σ\sigma, the samples are transformed into z=μ+σϵz = \mu+ \sigma \odot \epsilon, where ϵ\epsilon does not get gradient flow, and the deterministic parameters that require optimization (namely, μ,σ,z\mu, \sigma, z) are now not random.

Gradients after the reparametrization trick

Conclusion

As a result, you are now familiar with the general theoretical outline of variational autoencoders, it's structure, and the intuition behind the reparametrization trick.

How did you like the theory?
Report a typo