Expectation Maximization and Mixture Models

March 13, 2021

In many settings, we have data that is generated by some underlying causes that we are actually interested in. Because we do not observe these causes, we cannot model them directly. Instead, we model them as latent variables and use the data we have to learn them or infer their parameters.

Fig. 1: Graphical representation of latent variable models

Consider a latent variable model: $Z \rightarrow X$ . Here, $X$ is our observed data, and we assume that it comes from the latent variable $Z$. In particular, the likelihood for $X$ is

\begin{aligned} p(X|\theta) &= \sum_{z} p(X, Z=z| \theta) \\ &= \sum_{z} {p(X| Z=z, \theta)} p(Z=z| \theta), \end{aligned}

that is we marginalize Z out. Denote $p(z) := p(Z=z| \theta)$ and $p( x|z, \theta) = p(X=x | Z = z, \theta)p(x|z, \theta) = p(X=x | Z = z, \theta)$ where $x \in \mathcal X$. Consider now a dataset $\mathcal D = \{ x_i \}_{i=1}^n$. Its log-likelihood is

\begin{aligned} \log p(\mathcal D| \theta) &= \sum_{i=1}^n \log p(x_i | \theta) \\ &= \sum_{i=1}^n \log \bigg( \sum_{z} p(x| z,\theta) p(z) \bigg). \end{aligned}

We can see that the equation above is non-linear w.r.t. to the parameters of the model $\theta$; hence, we do not have a closed-form solution.

Expectation Maximization (EM)

To find the model's parameters $\theta$ , we instead use an iterative procedure. It consists of two steps, namely expectation and maximization.

First, we consider the complete log likelihood $\log p(\mathcal D, z | \theta)$. We then define the loss function

\begin{aligned} \mathcal{L}(\theta, \theta^t) &:= \mathbb{E}_Z [ \log p(\mathcal D, Z| \theta) \\ &= \sum_{z} p(z| \mathcal D, \theta^t) \log p(\mathcal D, z | \theta), \end{aligned}

where $\theta_t$ is the parameters of the previous iteration. We know that

\begin{aligned} p(\mathcal D , z | \theta) &= \prod_{i=1}^n p(x_i, z| \theta) \\ &= \prod_{i=1}^n p( x_i|z, \theta) p(z). \end{aligned}

Combining $\mathcal L (\theta, \theta_t)$ and $p(\mathcal D, z| \theta)$ yields

\begin{aligned} \mathcal{L}(\theta, \theta^t) &= \sum_z p(z | \mathcal D, \theta^t) \log \bigg( \prod_{i=1}^n p(x_i|z,\theta) p(z) \bigg) \\ &= \sum_x p(z|\mathcal D, \theta^t) \sum_{i=1}^n \bigg[ \log p(x_i|z, \theta) + \log p(z) \bigg]. \end{aligned}

To maximize $\mathcal L(\theta, \theta^t)$, we alternate between

1. Expectation step computing the posterior $p(z| \mathcal D, \theta^t), \forall z \in \mathcal Z$.
2. Maximization step finding the model's parameters $\theta$ that maximize $\mathcal L (\theta, \theta^t)$; this can be done by $\nabla_\theta \mathcal L(\theta, \theta^t) \overset{!}{=} 0$ and solve for each $\theta$.

It can be shown that the loss function increases every step; that is $\mathcal{L}(\theta, \theta^{t+1}) \ge \mathcal{L}(\theta, \theta^t)$.

Mixture Models

Mixture of Gaussians

We assume that our data $\mathcal D = \{ \mathbf x_i \in \reals^d \}_{i=1}^n$ is generated from some latent factor $z \in [1, K]$. In particular, each sample's likelihood is

$p(\mathbf x_i | \theta) = \sum_{k=1}^K \underbrace{p(z_i=k)}_{=:\pi_k} p(\mathbf x_i| \theta _k),$

where $\pi_k$'s are mixing weights and $\sum_k \pi_k = 1$. Here we assume that each component is a Gaussian distribution with $\theta_k= \{ \mu_k, \Sigma_k \}$ ; hence, we have

$p(\mathbf x | \theta_k ) = \mathcal N ({\mathbf x}| \mathbf \mu_k, \Sigma_k).$

Consider $\mathbf z_i \in \{0, 1\}^K$ be the true mixture assignment of $\mathbf x_i$ (i.e., it is a one-hot vector whose is one at the cluster that $\mathbf x_i$ belongs to). The complete log likelihood is

\begin{aligned} l_c(\theta) &:= \sum_{i=1}^n \log p(\mathbf x_i, \mathbf{z}_i | \theta) \end{aligned}

Let $z_i$ be the cluster index for $\mathbf x_i$ (the index of the one entry). We compute the expected complete log likelihood w.r.t to the approximate posterior $q(z|\mathbf x, \theta^{t})$:

\begin{aligned} \mathcal{L}(\theta, \theta^{t}) &= \mathbb{E}_{q(z|\mathbf x, \theta^{t})}\bigg[ \sum_{i=1}^n \log p(\mathbf x_i, Z_i|\theta)\bigg] \\ &= \sum_{i=1}^n \mathbb{E}_{q(z|\mathbf x, \theta^{t})} \bigg[ \log p(\mathbf x_i, Z_i | \theta) \bigg] \\ &= \sum_{i=1}^n \mathbb{E}_{q(z|\mathbf x, \theta^{t})} \bigg[ \log \prod_{k=1}^K ( p(\mathbf x_i | \theta_k ) \pi_k )^{1[Z_i = k ]}\bigg] \\ &= \sum_{i=1}^n \mathbb{E}_{q(z|\mathbf x, \theta^{t})} \bigg[ \sum_k {1[z_i = k ]} \log p(\mathbf x_i | \theta_k ) \pi_k \bigg] \\ &= \sum_{i=1}^n \sum_k \mathbb{E}_{q(z|\mathbf x, \theta^{t})} {1[Z_i = k ]} \log p(\mathbf x_i | \theta_k ) \pi_k \\ &= \sum_{i=1}^n \sum_k p(z_i=k| \mathbf x_i, \theta^{t}) \log p(\mathbf x_i | \theta_k ) \pi_k \end{aligned}

Define the responsibility (or posterior) $r_{ik} := p(z_i=k|\mathbf x_i, \theta^{t})$. We arrive at

$\mathcal{L}(\theta, \theta^{t}) = \sum_{i=1}^n \sum_{k=1}^K r_{ik} [\log p(\mathbf x_i | \theta_k) + \log \pi_k]$

Expectation Step

Due to the fact that

$p(Z|\mathbf x, \theta^{t}) = \frac{p(\mathbf x| Z, \theta^{t})}{p(\mathbf x | \theta^{t})},$

we can then write

$r_{ik} = \frac{\pi_k p(\mathbf x_i | \theta_k^{t})}{\sum_{k'} \pi_{k'} p(\mathbf x_i | \theta_{k'}^{t})}.$

Maximization Step

We compute $\forall _k, \pi_k, \theta_k := (\mu_k, \Sigma_k)$ that maximize $\mathcal{L}(\theta, \theta^{t})$.

Updating $\pi_k$

We know that $\sum_k \pi_k = 1$. Using Lagrange's multiplier, we get

$L = \mathcal{L} (\theta, \theta^{t}) - \lambda \bigg(\sum_k \pi_k - 1\bigg)$

Computing $\frac{\partial L}{\partial \pi_k}$ and set it to zero yields

$\sum_i \frac{r_{ik}}\lambda = \pi_k.$

Combing this into the constraint, we have

\begin{aligned} \sum_k \sum_i r_{ik} &= \lambda \\ \sum_i \underbrace{\sum_k r_{ik}}_{=1} &= \lambda \end{aligned}

Hence, $\lambda = N$ and $\pi_k = \frac{1}{N} \sum_i r_{ik}$.

Updating $\mu_k$

Recall that $\mathcal{N}(\mathbf x| \mu_k, \Sigma_k) = \frac{1}{(\sqrt{(2\pi)^d|\Sigma_k|}) }\exp\bigg(-\frac{1}{2}(\mathbf x - \mu_k)^T\Sigma_k^{-1}(\mathbf x - \mu_k)\bigg)$; hence, we have

\begin{aligned} \mathcal{L}(\theta, \theta^{t-1}) = \sum_i \sum_k r_{ik}\bigg[ \log \pi_k + \log \bigg( \frac{1}{\sqrt{(2\pi)^d|\Sigma_k|}} \bigg) + \frac{1}{2} (\mathbf x_i - \mu_k)^T\Sigma_k^{-1}(\mathbf x_i - \mu_k)\bigg]. \end{aligned}

Taking the derivative w.r.t $\mu_k$ we get

$\nabla_{\mu_k} \mathcal L (\theta, \theta^{t-1}) = \sum_i r_{ik} \Sigma_k^{-1} (\mathbf x_i - \mu_k).$

Setting it to zero yields

$\mu_k = \frac{\sum_i r_{ik} \mathbf x_i}{\sum_i r_{ik}}$

Updating $\Sigma_k$

Taking the derivative yields

\begin{aligned} \nabla_{\Sigma_k} \log \frac{1}{\sqrt{{(2\pi)^d} |\Sigma_k|} } &= \sqrt{\cancel{(2\pi)^d} |\Sigma_k|} \frac{1}{\cancel{\sqrt{(2\pi)^d}}} \nabla_{\Sigma_k} |\Sigma_k|^{-1/2} \\ &= \cancel{\sqrt{|\Sigma_k|}} \bigg[ -\frac{1}{2} \cancel{(\sqrt{|\Sigma_k|})^{-1}} \Sigma_k^{-1} \bigg] \\ &= - \frac{\Sigma_k^{-1}}{2}. \end{aligned}

We also have that

$\nabla_{\Sigma_k} (\mathbf x - \mu_k) \Sigma_k^{-1} (\mathbf x - \mu_k)= -\Sigma_k^{-1}(\mathbf x - \mu_k)(\mathbf x - \mu_k)^T \Sigma_k^{-1}.$

Combining the two terms and set the derivative to zero, we arrive at

\begin{aligned} \sum_i \frac{r_{ik} \Sigma_k^{-1}}{2} &= \sum_i \frac{r_{ik}}{2} \Sigma_k^{-1}(\mathbf x_i - \mu_k)(\mathbf x_i - \mu_k)^T \Sigma_k^{-1} \\ \Sigma_k^{-1}\sum_i r_{ik} &= \Sigma_k^{-1} \sum_i r_{ik} (\mathbf x_i - \mu_k)(\mathbf x_i - \mu_k)^T \Sigma_k^{-1}. \end{aligned}

Therefore, we have

$\Sigma_k = \frac{\sum_i r_{ik}(\mathbf x_i - \mu_k)(\mathbf x_i - \mu_k)^T}{\sum_i r_{ik}}.$
Fig. 2: Expectation-Maximization steps for a mixture of Gaussians

K-Means Algorithm: Special Case of GMMs

One can view K-means as a special case of GMMs that greedily chooses the closest centroid for each $\mathbf x_i$ . In this case, we assume that $\pi_k = \frac{1}{K}$ and $\Sigma_k = \sigma^2I_d$ for $\sigma \in \R_+$. The posterior is

$p(z_i = k | \mathbf x_i, \theta) = 1[k = z_i^*],$

where $z^*_i = \argmax_k p(z_i =k| \mathbf x_i, \theta)$. Equivalently, this is similar to

$z_i^* = \argmin_k \| \mathbf x_i - \mu_k \|^2.$

As you can see from the derivation, this is a hard assignment assigning a cluster to the sample. Consider $\mathcal D_k \subset \mathcal D$ be the samples belong to the $k$-th cluster. the maximization step is then just

$\mu_k = \frac{1}{|\mathcal D_k|} \sum_{\mathbf x \in \mathcal D_k} \mathbf x$

Mixture of Binomials

Consider $z \in\{ H, T\}$ and we assume that it comes from a Bernoulli distribution:

\begin{aligned} p(z=H | \theta) &= \pi_H \\ p(z=T | \theta) &= \pi_T, \end{aligned}

where $\pi_H + \pi_T = 1$. Our setting has two Binomial mixture components, namely

\begin{aligned} p(x | z=H, \theta) &= {m \choose x} \mu_H^m (1-\mu_H)^{m-x} \\ p(x | z=T, \theta) &= {m \choose x} \mu_T^m (1-\mu_T)^{m-x}, \end{aligned}

where $x \in [0, m]$. In our setting, we have $\theta := \{ \pi_H, \pi_T, \mu_H, \mu_T\}$. Let $\mathcal{D} = (x_1, \dots, x_n).$ Recall the complete log-likelihood:

\begin{aligned} \mathcal{L}(\theta, \theta^{t}) = \sum_{i=1}^n \sum_{z=H, T} r_{iz} [ \log \pi_z + \log p(x_i|z, \theta_z) ]. \end{aligned}

Expectation Step

We first recall and write

\begin{aligned} r_{iH} &= \frac{\pi_H p(x_i | \theta_H^t)}{\sum_{z=H,T} \pi_z p(x_i | \theta_z^t) } \\ &= \frac{\pi_H \mu_H^m ( 1- \mu_H)^{m-x_i} }{\sum_{z=H,T} \pi_z p(x_i | \theta_z^t) }. \end{aligned}

Similarly, we have

\begin{aligned} r_{iH} &= \frac{\pi_T \mu_T^m ( 1- \mu_T)^{m-x_i} }{\sum_{z=H,T} \pi_z p(x_i | \theta_z^t) }. \end{aligned}

Maximization Step

Recall that $\theta = \{ \pi_H, \pi_T, \mu_H, \mu_T\}$. We compute the derivative of $\mathcal{L}( \theta, \theta^{t})$ w.r.t the parameters and set to zero.

Updating $\pi_H$ and $\pi_T$

\begin{aligned} \frac{\partial \mathcal{L}(\theta, \theta^{t})}{\partial \pi_H } &= \sum_{i=1}^n r_{iH} \frac{\partial}{\partial \pi_H} \bigg [\log \pi_H \bigg] + r_{iT} \frac{\partial}{\partial \pi_H} \bigg [\log 1-\pi_H \bigg] \\ &= \sum_{i=1}^n \frac{r_{iH}}{\pi_H} - \frac{r_{iT}}{1-\pi_H} \\ &\overset{!}{=} 0 \end{aligned}

We get

\begin{aligned} (1-\pi_H)\sum_{i=1}^n r_{iH} &= \lambda\sum_{i=1}^n r_{iT} \\ \sum_{i=1}^n r_{iH} &= \lambda\bigg(\sum_{i=1}^n \underbrace{r_{iT} + r_{iH}}_1\bigg) \\ \pi_H &= \frac 1 n \sum_{i=1}^n r_{iH}, \end{aligned}

and $\pi_T = 1 - \pi_H$.

Updating $\mu_H$ and $\mu_T$

\begin{aligned} \frac{\partial \mathcal{L}(\theta, \theta^{t-1})}{\partial \mu_H} &= \sum_{i=1}^n r_{iH} \frac {\partial } {\partial \mu_H}\bigg( \log {m \choose x_i} + x_i \log \mu_H + (m-x_i)\log(1-\mu_H) \bigg ) \\ &= \sum_{i=1}^n r_{iH} \bigg(\frac{x_i}{\mu_H} - \frac{m-x_i}{(1-\mu_H)} \bigg) \\ &\overset{!}{=} 0 \end{aligned}

Therefore, we have

\begin{aligned} \sum_{i=1}^n \frac{r_{iH}x_i}{\mu_H} &= \sum_{i=1}^n \frac{r_{iH}(m-x_i)}{(1-\mu_H)} \\ (1-\mu_H) \sum_{i=1}^n {r_{iH}x_i} &= \mu_H \sum_{i=1}^n r_{iH}(m-x_i) \\ \mu_H &= \frac{\sum_{i=1}^n {r_{iH}x_i} }{\sum_{i=1}^n {r_{iH}m} }. \end{aligned}

Similarly, we have $\mu_T = \frac{\sum_{i=1}^n {r_{iT}x_i} }{\sum_{i=1}^n {r_{iT}m} }$.

Fig. 3: Expectation-Maximization steps for a mixture of binomials

Conclusions

• Murphy (2012), "Machine Learning: A Probabilistic Perspective" (Section 11.4) for the general idea of the EM algorithm.
• Manfred Opper's Probabilistic Bayesian Modeling course, Lecture Laplace Approximation (TU Berlin, Summer 2020) for Proof of the EM step
• Woijeck Samek's Machine Learning 1 course, Lecture Latent Variable Models (TU Berlin, Winter 2020) for the Mixture of Binomials example

Proof: the EM procedure increases the expected complete log-likelihood

In the following, we will show that $\mathcal{L}(\theta, \theta^{t+1}) \ge \mathcal{L}(\theta, \theta^t)$.

Consider $q(\cdot)$ be any distribution over the latent variable $Z$. We know that $p(\mathcal D, z| \theta) = p(z|\mathcal D, \theta) p(\mathcal D | \theta)$ . We can then compute

\begin{aligned} \text{KL} ( q(Z) \| p(Z |\mathcal D, \theta)) &= \sum_{z} q(x) \log \frac{q(z)}{p(z|\mathcal D, \theta)}\\ &= \sum_{z} q(z) \log \frac{q(z) p(\mathcal D | \theta)}{p(z, \mathcal D| \theta)} \end{aligned}

Because $\text{KL}(\cdot \| \cdot) \ge 0$, we have

\begin{aligned} \underbrace{\sum_z q(z) \log \frac{q(z)}{p(z, \mathcal D | \theta)}}_{=: F(q, \theta) } \ge - \log p(\mathcal D | \theta) \end{aligned}

Now consider $q_t(z) := p(z | \mathcal D, \theta^t)$. We have

\begin{aligned} \mathcal{L}(\theta, \theta^t) &= \sum_z q_t(z) \log p(z, \mathcal D | \theta) \\ &= \sum_z q_t(z) \log p(z, \mathcal D | \theta) \frac{q_t(z) }{q_t(z)} \\ &= - F(q_t, \theta) + \sum_z q_t(z) \log q_t(z) \end{aligned}

Consider the fact that $F(q_t, \theta) \ge - \log p(\mathcal D| \theta)$. We have two cases, namely

1. $F(q_t, \theta) > - \log p(\mathcal D|\theta)$
2. $F(q_t, \theta) = - \log p( \mathcal D | \theta) \implies F(q_t, \theta_t) = - \log p(\mathcal D |\theta_t)$.

Subtract these two conditions lead to

$F(q_t, \theta) - F(q_t, \theta_t) \ge - \log p(\mathcal D | \theta) + \log p(\mathcal D| \theta_t)$

In other words, we have

\begin{aligned} \log p(\mathcal D | \theta) - \log p(\mathcal D | \theta^t) &> - F(q_t, \theta) + F(q_t, \theta^t) \\ &> - ( \underbrace{ F(q_t, \theta) - F(q_t, \theta^t)}_{\overset{(*)}\le 0}) \\ &> 0. \end{aligned}

$(*)$ is due to the fact that we maximize $\mathcal{L}(\theta, \theta^t)$ w.r.t $\theta$, which is equivalent to minimize $F(q_t, \theta)$.

This concludes that the EM procedure always increases the expected complete log likelihood. However, the procedure does not guarantee to reach the global optimum.