# Teaching Variational Autoencoders

Trying to explain the fundamental concepts behind variational autoencoders made me realize something much deeper about learning and teaching.

It took me quite a while to deeply understand variational autoencoders (VAEs). I feel like many online tutorials only provide a shallow overview of the topic, get bogged down explaining the (many) details of how they work, and barely, if at all, touch the topic of what they actually are. It really nothing to do with autoencoders, really.

Assuming that one knows what an autoencoder (AE) is, the idea of a VAE can be reached easily by adding a fancy regularization term to the AE loss. The justification for this regularizer is to force the latent space to be Gaussian so that we can generate new samples that look like the ones in the training set.

While all of this is true, it is kind of missing the forest for the trees. The
key ideas behind VAEs are *latent variable models* and *amortization*.
Everything else, including variational inference and even autoencoders, is just
noise. So many things in machine learning derive from latent variable models
that studying the derivation of the ELBO early is mostly a waste of time. One
should rather focus on the high level concepts and their connections.

## Is that actually true?

One should rather focus on the high level concepts. Right now I am wondering whether what I am saying is too reductionist, i.e., while true, maybe it is not very helpful for those who are still learning the topic. I am reminded of a (zen?) poem that I cannot find anymore but goes something like:

A beginner sees a mountain.

An expert sees rocks.

A master sees a mountain.

I really like this poem because it is so true, simple and deep at the same time.
The beginner and the master have the same point of view but for opposite
reasons. The beginner only sees a mountain because according to their
inexperience that’s all there is. The master sees a mountain because in their
experience all mountains are just the same. Experts are in that intermediate
stage where every mountain is an unique arrangement of rocks, and every rock has
a different shape than the next, and there are so many possible combinations. An
implication of the poem is that one cannot be a master without seeing all the
rocks first. **Even though mastery ends in the same place where it begins, the
journey makes all the difference.**

This happens in every discipline with a certain depth, including obviously
machine learning. It takes five minutes to understand generative models and
dimensionality reduction on a high level. But try to go under the surface and
you find autoencoders, variational autoencoders, conditional autoencoders,
principal component analysis, probabilistic principal component analysis,
non-linear principal component analysis, Gaussian process latent variable
models, Markov chains, hidden Markov models, normalizing flows, normalizing
flows *with* autoencoders, generative adversarial networks, Bayes nets, Factor
analysis, Markov random fields, ans so on, each of them with tens of variants
and with their own specific inference procedure, whether exact, approximate, and
whatnot. Only after studying all of this one realizes, deep in their heart, that
all of these are pieces of the same puzzle, or rocks of the same mountain.
Initially it’s only ignorance, at the end it’s knowledge.

That is why I felt this post was useless. Masters know this already. Experts are too busy looking at rocks and still see the mountain with the perspective of a beginner. Beginners are not yet aware of the presence of the rocks.

## Back to variational autoencoders

Since I was interrupted halfway through, I may as well finish what I started. So forget about autoencoders and think at latent variable models. We assume that every sample $x$ was generated starting from a latent variable $z$. How? No idea, so let’s train a deep neural network figure that out.

Speaking probabilistically, we assume that $x$ is a sample from a fixed distribution $\mathcal{D}$ (usually Gaussian) with a location parameter given that the neural network $f$ starting from the latent variable $z$, i.e.,

\[x\sim\mathcal{D}(f(z))\]At this moment, we have no idea what $f$ and $z$ are. There are several
*inference* procedures to make this guess depending on what assumptions one is
willing to make. For example, when $f$ is known and not too complicated, the expectation maximization algorithm
could be an option. In this case, however, we opt for variational inference
(VI), a method that allows us to fit an approximate distribution of our choice
to $z$ even with complicated, possibly unknown, $f$. As is common, we assume
that the approximate distribution is a standard normal and we do not care about
the variance, i.e., $z\sim\mathcal{N}(\mu_z, 1)$, where 1 is the identity
matrix.

A direct application of VI would keep $\mu_z$ as a vector of numbers
and adjust their values along with the weights of $f$ to best suit the dataset.
In theory this would work if all one cared about was the latent variables of the
dataset at hand. In practice, this may require lots of training examples to pull
off without overfitting. Hence, the second key idea of VAEs: *amortization*. It
is both simple and crazy at the same time: instead of keeping a separate $\mu$
for each example, let’s have another deep neural network predict it! Calling
this network $g$, we have:

Note that if you use linear transformations for $f$ and $g$ you get probabilistic PCA and you can find which matrix to use without using VI at all. Keeping linearity and using different assumptions on $p(z)$ leads to factor analysis, categorical PCA, canonical correlation analysis, and independent component analysis. Variational autoencoders pop up from using deep neural networks and VI. One could stick a normalizing flow at the end of $g$, perhaps make it auto-regressive to deal with sequence data. You could have an intermediate clustering step in the latent space, or, even better, have $g$ do the clustering by using a mixture distribution for $p(z|x)$. Different rocks, same mountain.

## Back to teaching

This was the frustration that motivated this post: you do not need to understand VI to understand VAEs. Explaining VI and VAE together makes things seem much more complicated than they actually are and hides connections to related topics.

Separating signal (latent variable models) from noise (variational inference),
or, better, high level from low level concepts, is quite hard to do when one is
getting into a new topic. This is maybe why this post is not so useless after
all: **experts can reach mastery faster if they are explicitly told what to
ignore at first**.