In many settings, we have data that is generated by some underlying causes that we are actually interested in. Because we do not observe these causes, we cannot model them directly. Instead, we model them as latent variables and use the data we have to learn them or infer their parameters.
Fig. 1: Graphical representation of latent variable models
Consider a latent variable model: . Here, is our observed data, and we assume that it comes from the latent variable . In particular, the likelihood for is
that is we marginalize Z out. Denote and where . Consider now a dataset . Its log-likelihood is
We can see that the equation above is non-linear w.r.t. to the parameters of the model ; hence, we do not have a closed-form solution.
Expectation Maximization (EM)
To find the model's parameters , we instead use an iterative procedure. It consists of two steps, namely expectation and maximization.
First, we consider the complete log likelihood . We then define the loss function
where is the parameters of the previous iteration. We know that
Combining and yields
To maximize , we alternate between
- Expectation step computing the posterior .
- Maximization step finding the model's parameters that maximize ; this can be done by and solve for each .
It can be shown that the loss function increases every step; that is .
Mixture of Gaussians
We assume that our data is generated from some latent factor . In particular, each sample's likelihood is
where 's are mixing weights and . Here we assume that each component is a Gaussian distribution with ; hence, we have
Consider be the true mixture assignment of (i.e., it is a one-hot vector whose is one at the cluster that belongs to). The complete log likelihood is
Let be the cluster index for (the index of the one entry). We compute the expected complete log likelihood w.r.t to the approximate posterior :
Define the responsibility (or posterior) . We arrive at
Due to the fact that
we can then write
We compute that maximize .
We know that . Using Lagrange's multiplier, we get
Computing and set it to zero yields
Combing this into the constraint, we have
Hence, and .
Recall that ; hence, we have
Taking the derivative w.r.t we get
Setting it to zero yields
Taking the derivative yields
We also have that
Combining the two terms and set the derivative to zero, we arrive at
Therefore, we have
Fig. 2: Expectation-Maximization steps for a mixture of Gaussians
K-Means Algorithm: Special Case of GMMs
One can view K-means as a special case of GMMs that greedily chooses the closest centroid for each . In this case, we assume that and for . The posterior is
where . Equivalently, this is similar to
As you can see from the derivation, this is a hard assignment assigning a cluster to the sample. Consider be the samples belong to the -th cluster. the maximization step is then just
Mixture of Binomials
Consider and we assume that it comes from a Bernoulli distribution:
where . Our setting has two Binomial mixture components, namely
where . In our setting, we have . Let Recall the complete log-likelihood:
We first recall and write
Similarly, we have
Recall that . We compute the derivative of w.r.t the parameters and set to zero.
Therefore, we have
Similarly, we have .
Fig. 3: Expectation-Maximization steps for a mixture of binomials
These are resources that I consulted while writing this article
- Murphy (2012), "Machine Learning: A Probabilistic Perspective" (Section 11.4) for the general idea of the EM algorithm.
- Manfred Opper's Probabilistic Bayesian Modeling course, Lecture Laplace Approximation (TU Berlin, Summer 2020) for Proof of the EM step
- Woijeck Samek's Machine Learning 1 course, Lecture Latent Variable Models (TU Berlin, Winter 2020) for the Mixture of Binomials example
This articles' figures are made in Google Colab.
Proof: the EM procedure increases the expected complete log-likelihood
In the following, we will show that .
Consider be any distribution over the latent variable . We know that . We can then compute
Because , we have
Now consider . We have
Consider the fact that . We have two cases, namely
Subtract these two conditions lead to
In other words, we have
is due to the fact that we maximize w.r.t , which is equivalent to minimize .
This concludes that the EM procedure always increases the expected complete log likelihood. However, the procedure does not guarantee to reach the global optimum.