Expectation Maximization and Mixture Models

March 13, 2021
Table of Content

In many settings, we have data that is generated by some underlying causes that we are actually interested in. Because we do not observe these causes, we cannot model them directly. Instead, we model them as latent variables and use the data we have to learn them or infer their parameters.

Fig. 1: Graphical representation of latent variable models

Consider a latent variable model: ZXZ \rightarrow X . Here, XX is our observed data, and we assume that it comes from the latent variable ZZ. In particular, the likelihood for XX is

p(Xθ)=zp(X,Z=zθ)=zp(XZ=z,θ)p(Z=zθ),\begin{aligned} p(X|\theta) &= \sum_{z} p(X, Z=z| \theta) \\ &= \sum_{z} {p(X| Z=z, \theta)} p(Z=z| \theta), \end{aligned}

that is we marginalize Z out. Denote p(z):=p(Z=zθ)p(z) := p(Z=z| \theta) and p(xz,θ)=p(X=xZ=z,θ)p(xz,θ)=p(X=xZ=z,θ)p( x|z, \theta) = p(X=x | Z = z, \theta)p(x|z, \theta) = p(X=x | Z = z, \theta) where xXx \in \mathcal X. Consider now a dataset D={xi}i=1n\mathcal D = \{ x_i \}_{i=1}^n. Its log-likelihood is

logp(Dθ)=i=1nlogp(xiθ)=i=1nlog(zp(xz,θ)p(z)).\begin{aligned} \log p(\mathcal D| \theta) &= \sum_{i=1}^n \log p(x_i | \theta) \\ &= \sum_{i=1}^n \log \bigg( \sum_{z} p(x| z,\theta) p(z) \bigg). \end{aligned}

We can see that the equation above is non-linear w.r.t. to the parameters of the model θ\theta; hence, we do not have a closed-form solution.

Expectation Maximization (EM)

To find the model's parameters θ\theta , we instead use an iterative procedure. It consists of two steps, namely expectation and maximization.

First, we consider the complete log likelihood logp(D,zθ)\log p(\mathcal D, z | \theta). We then define the loss function

L(θ,θt):=EZ[logp(D,Zθ)=zp(zD,θt)logp(D,zθ),\begin{aligned} \mathcal{L}(\theta, \theta^t) &:= \mathbb{E}_Z [ \log p(\mathcal D, Z| \theta) \\ &= \sum_{z} p(z| \mathcal D, \theta^t) \log p(\mathcal D, z | \theta), \end{aligned}

where θt\theta_t is the parameters of the previous iteration. We know that

p(D,zθ)=i=1np(xi,zθ)=i=1np(xiz,θ)p(z).\begin{aligned} p(\mathcal D , z | \theta) &= \prod_{i=1}^n p(x_i, z| \theta) \\ &= \prod_{i=1}^n p( x_i|z, \theta) p(z). \end{aligned}

Combining L(θ,θt)\mathcal L (\theta, \theta_t) and p(D,zθ)p(\mathcal D, z| \theta) yields

L(θ,θt)=zp(zD,θt)log(i=1np(xiz,θ)p(z))=xp(zD,θt)i=1n[logp(xiz,θ)+logp(z)].\begin{aligned} \mathcal{L}(\theta, \theta^t) &= \sum_z p(z | \mathcal D, \theta^t) \log \bigg( \prod_{i=1}^n p(x_i|z,\theta) p(z) \bigg) \\ &= \sum_x p(z|\mathcal D, \theta^t) \sum_{i=1}^n \bigg[ \log p(x_i|z, \theta) + \log p(z) \bigg]. \end{aligned}

To maximize L(θ,θt)\mathcal L(\theta, \theta^t), we alternate between

  1. Expectation step computing the posterior p(zD,θt),zZp(z| \mathcal D, \theta^t), \forall z \in \mathcal Z.
  2. Maximization step finding the model's parameters θ\theta that maximize L(θ,θt)\mathcal L (\theta, \theta^t); this can be done by θL(θ,θt)=!0\nabla_\theta \mathcal L(\theta, \theta^t) \overset{!}{=} 0 and solve for each θ\theta.

It can be shown that the loss function increases every step; that is L(θ,θt+1)L(θ,θt)\mathcal{L}(\theta, \theta^{t+1}) \ge \mathcal{L}(\theta, \theta^t).

Mixture Models

Mixture of Gaussians

We assume that our data D={xiRd}i=1n\mathcal D = \{ \mathbf x_i \in \reals^d \}_{i=1}^n is generated from some latent factor z[1,K]z \in [1, K]. In particular, each sample's likelihood is

p(xiθ)=k=1Kp(zi=k)=:πkp(xiθk),p(\mathbf x_i | \theta) = \sum_{k=1}^K \underbrace{p(z_i=k)}_{=:\pi_k} p(\mathbf x_i| \theta _k),

where πk\pi_k's are mixing weights and kπk=1\sum_k \pi_k = 1. Here we assume that each component is a Gaussian distribution with θk={μk,Σk}\theta_k= \{ \mu_k, \Sigma_k \} ; hence, we have

p(xθk)=N(xμk,Σk).p(\mathbf x | \theta_k ) = \mathcal N ({\mathbf x}| \mathbf \mu_k, \Sigma_k).

Consider zi{0,1}K\mathbf z_i \in \{0, 1\}^K be the true mixture assignment of xi\mathbf x_i (i.e., it is a one-hot vector whose is one at the cluster that xi\mathbf x_i belongs to). The complete log likelihood is

lc(θ):=i=1nlogp(xi,ziθ)\begin{aligned} l_c(\theta) &:= \sum_{i=1}^n \log p(\mathbf x_i, \mathbf{z}_i | \theta) \end{aligned}

Let ziz_i be the cluster index for xi\mathbf x_i (the index of the one entry). We compute the expected complete log likelihood w.r.t to the approximate posterior q(zx,θt)q(z|\mathbf x, \theta^{t}):

L(θ,θt)=Eq(zx,θt)[i=1nlogp(xi,Ziθ)]=i=1nEq(zx,θt)[logp(xi,Ziθ)]=i=1nEq(zx,θt)[logk=1K(p(xiθk)πk)1[Zi=k]]=i=1nEq(zx,θt)[k1[zi=k]logp(xiθk)πk]=i=1nkEq(zx,θt)1[Zi=k]logp(xiθk)πk=i=1nkp(zi=kxi,θt)logp(xiθk)πk\begin{aligned} \mathcal{L}(\theta, \theta^{t}) &= \mathbb{E}_{q(z|\mathbf x, \theta^{t})}\bigg[ \sum_{i=1}^n \log p(\mathbf x_i, Z_i|\theta)\bigg] \\ &= \sum_{i=1}^n \mathbb{E}_{q(z|\mathbf x, \theta^{t})} \bigg[ \log p(\mathbf x_i, Z_i | \theta) \bigg] \\ &= \sum_{i=1}^n \mathbb{E}_{q(z|\mathbf x, \theta^{t})} \bigg[ \log \prod_{k=1}^K ( p(\mathbf x_i | \theta_k ) \pi_k )^{1[Z_i = k ]}\bigg] \\ &= \sum_{i=1}^n \mathbb{E}_{q(z|\mathbf x, \theta^{t})} \bigg[ \sum_k {1[z_i = k ]} \log p(\mathbf x_i | \theta_k ) \pi_k \bigg] \\ &= \sum_{i=1}^n \sum_k \mathbb{E}_{q(z|\mathbf x, \theta^{t})} {1[Z_i = k ]} \log p(\mathbf x_i | \theta_k ) \pi_k \\ &= \sum_{i=1}^n \sum_k p(z_i=k| \mathbf x_i, \theta^{t}) \log p(\mathbf x_i | \theta_k ) \pi_k \end{aligned}

Define the responsibility (or posterior) rik:=p(zi=kxi,θt)r_{ik} := p(z_i=k|\mathbf x_i, \theta^{t}). We arrive at

L(θ,θt)=i=1nk=1Krik[logp(xiθk)+logπk]\mathcal{L}(\theta, \theta^{t}) = \sum_{i=1}^n \sum_{k=1}^K r_{ik} [\log p(\mathbf x_i | \theta_k) + \log \pi_k]

Expectation Step

Due to the fact that

p(ZX,θt)=p(X,Zθt)p(Xθt),p(Z|X, \theta^{t}) = \frac{p(X, Z| \theta^{t})}{p(X| \theta^{t})},

we can then write

rik=πkp(xiθkt)kπkp(xiθkt).r_{ik} = \frac{\pi_k p(\mathbf x_i | \theta_k^{t})}{\sum_{k'} \pi_{k'} p(\mathbf x_i | \theta_{k'}^{t})}.

Maximization Step

We compute k,πk,θk:=(μk,Σk)\forall _k, \pi_k, \theta_k := (\mu_k, \Sigma_k) that maximize L(θ,θt)\mathcal{L}(\theta, \theta^{t}).

Updating πk\pi_k

We know that kπk=1\sum_k \pi_k = 1. Using Lagrange's multiplier, we get

L=L(θ,θt)λ(kπk1)L = \mathcal{L} (\theta, \theta^{t}) - \lambda \bigg(\sum_k \pi_k - 1\bigg)

Computing Lπk\frac{\partial L}{\partial \pi_k} and set it to zero yields

irikλ=πk.\sum_i \frac{r_{ik}}\lambda = \pi_k.

Combing this into the constraint, we have

kirik=λikrik=1=λ\begin{aligned} \sum_k \sum_i r_{ik} &= \lambda \\ \sum_i \underbrace{\sum_k r_{ik}}_{=1} &= \lambda \end{aligned}

Hence, λ=N\lambda = N and πk=1Nirik\pi_k = \frac{1}{N} \sum_i r_{ik}.

Updating μk\mu_k

Recall that N(xμk,Σk)=1((2π)dΣk)exp(12(xμk)TΣk1(xμk))\mathcal{N}(\mathbf x| \mu_k, \Sigma_k) = \frac{1}{(\sqrt{(2\pi)^d|\Sigma_k|}) }\exp\bigg(-\frac{1}{2}(\mathbf x - \mu_k)^T\Sigma_k^{-1}(\mathbf x - \mu_k)\bigg); hence, we have

L(θ,θt1)=ikrik[logπk+log(1(2π)dΣk)+12(xiμk)TΣk1(xiμk)].\begin{aligned} \mathcal{L}(\theta, \theta^{t-1}) = \sum_i \sum_k r_{ik}\bigg[ \log \pi_k + \log \bigg( \frac{1}{\sqrt{(2\pi)^d|\Sigma_k|}} \bigg) + \frac{1}{2} (\mathbf x_i - \mu_k)^T\Sigma_k^{-1}(\mathbf x_i - \mu_k)\bigg]. \end{aligned}

Taking the derivative w.r.t μk\mu_k we get

μkL(θ,θt1)=irikΣk1(xiμk).\nabla_{\mu_k} \mathcal L (\theta, \theta^{t-1}) = \sum_i r_{ik} \Sigma_k^{-1} (\mathbf x_i - \mu_k).

Setting it to zero yields

μk=irikxiirik\mu_k = \frac{\sum_i r_{ik} \mathbf x_i}{\sum_i r_{ik}}

Updating Σk\Sigma_k

Taking the derivative yields

Σklog1(2π)dΣk=(2π)dΣk1(2π)dΣkΣk1/2=Σk[12(Σk)1Σk1]=Σk12.\begin{aligned} \nabla_{\Sigma_k} \log \frac{1}{\sqrt{{(2\pi)^d} |\Sigma_k|} } &= \sqrt{\cancel{(2\pi)^d} |\Sigma_k|} \frac{1}{\cancel{\sqrt{(2\pi)^d}}} \nabla_{\Sigma_k} |\Sigma_k|^{-1/2} \\ &= \cancel{\sqrt{|\Sigma_k|}} \bigg[ -\frac{1}{2} \cancel{(\sqrt{|\Sigma_k|})^{-1}} \Sigma_k^{-1} \bigg] \\ &= - \frac{\Sigma_k^{-1}}{2}. \end{aligned}

We also have that

Σk(xμk)Σk1(xμk)=Σk1(xμk)(xμk)TΣk1.\nabla_{\Sigma_k} (\mathbf x - \mu_k) \Sigma_k^{-1} (\mathbf x - \mu_k)= -\Sigma_k^{-1}(\mathbf x - \mu_k)(\mathbf x - \mu_k)^T \Sigma_k^{-1}.

Combining the two terms and set the derivative to zero, we arrive at

irikΣk12=irik2Σk1(xiμk)(xiμk)TΣk1Σk1irik=Σk1irik(xiμk)(xiμk)TΣk1.\begin{aligned} \sum_i \frac{r_{ik} \Sigma_k^{-1}}{2} &= \sum_i \frac{r_{ik}}{2} \Sigma_k^{-1}(\mathbf x_i - \mu_k)(\mathbf x_i - \mu_k)^T \Sigma_k^{-1} \\ \Sigma_k^{-1}\sum_i r_{ik} &= \Sigma_k^{-1} \sum_i r_{ik} (\mathbf x_i - \mu_k)(\mathbf x_i - \mu_k)^T \Sigma_k^{-1}. \end{aligned}

Therefore, we have

Σk=irik(xiμk)(xiμk)Tirik.\Sigma_k = \frac{\sum_i r_{ik}(\mathbf x_i - \mu_k)(\mathbf x_i - \mu_k)^T}{\sum_i r_{ik}}.
Fig. 2: Expectation-Maximization steps for a mixture of Gaussians

K-Means Algorithm: Special Case of GMMs

One can view K-means as a special case of GMMs that greedily chooses the closest centroid for each xi\mathbf x_i . In this case, we assume that πk=1K\pi_k = \frac{1}{K} and Σk=σ2Id\Sigma_k = \sigma^2I_d for σR+\sigma \in \R_+. The posterior is

p(zi=kxi,θ)=1[k=zi],p(z_i = k | \mathbf x_i, \theta) = 1[k = z_i^*],

where zi=arg maxkp(zi=kxi,θ)z^*_i = \argmax_k p(z_i =k| \mathbf x_i, \theta). Equivalently, this is similar to

zi=arg minkxiμk2.z_i^* = \argmin_k \| \mathbf x_i - \mu_k \|^2.

As you can see from the derivation, this is a hard assignment assigning a cluster to the sample. Consider DkD\mathcal D_k \subset \mathcal D be the samples belong to the kk-th cluster. the maximization step is then just

μk=1DkxDkx\mu_k = \frac{1}{|\mathcal D_k|} \sum_{\mathbf x \in \mathcal D_k} \mathbf x

Mixture of Binomials

Consider z{H,T}z \in\{ H, T\} and we assume that it comes from a Bernoulli distribution:

p(z=Hθ)=πHp(z=Tθ)=πT,\begin{aligned} p(z=H | \theta) &= \pi_H \\ p(z=T | \theta) &= \pi_T, \end{aligned}

where πH+πT=1\pi_H + \pi_T = 1. Our setting has two Binomial mixture components, namely

p(xz=H,θ)=(mx)μHm(1μH)mxp(xz=T,θ)=(mx)μTm(1μT)mx,\begin{aligned} p(x | z=H, \theta) &= {m \choose x} \mu_H^m (1-\mu_H)^{m-x} \\ p(x | z=T, \theta) &= {m \choose x} \mu_T^m (1-\mu_T)^{m-x}, \end{aligned}

where x[0,m]x \in [0, m]. In our setting, we have θ:={πH,πT,μH,μT}\theta := \{ \pi_H, \pi_T, \mu_H, \mu_T\}. Let D=(x1,,xn).\mathcal{D} = (x_1, \dots, x_n). Recall the complete log-likelihood:

L(θ,θt)=i=1nz=H,Triz[logπz+logp(xiz,θz)].\begin{aligned} \mathcal{L}(\theta, \theta^{t}) = \sum_{i=1}^n \sum_{z=H, T} r_{iz} [ \log \pi_z + \log p(x_i|z, \theta_z) ]. \end{aligned}

Expectation Step

We first recall and write

riH=πHp(xiθHt)z=H,Tπzp(xiθzt)=πHμHm(1μH)mxiz=H,Tπzp(xiθzt).\begin{aligned} r_{iH} &= \frac{\pi_H p(x_i | \theta_H^t)}{\sum_{z=H,T} \pi_z p(x_i | \theta_z^t) } \\ &= \frac{\pi_H \mu_H^m ( 1- \mu_H)^{m-x_i} }{\sum_{z=H,T} \pi_z p(x_i | \theta_z^t) }. \end{aligned}

Similarly, we have

riH=πTμTm(1μT)mxiz=H,Tπzp(xiθzt).\begin{aligned} r_{iH} &= \frac{\pi_T \mu_T^m ( 1- \mu_T)^{m-x_i} }{\sum_{z=H,T} \pi_z p(x_i | \theta_z^t) }. \end{aligned}

Maximization Step

Recall that θ={πH,πT,μH,μT}\theta = \{ \pi_H, \pi_T, \mu_H, \mu_T\}. We compute the derivative of L(θ,θt)\mathcal{L}( \theta, \theta^{t}) w.r.t the parameters and set to zero.

Updating πH\pi_H and πT\pi_T

L(θ,θt)πH=i=1nriHπH[logπH]+riTπH[log1πH]=i=1nriHπHriT1πH=!0\begin{aligned} \frac{\partial \mathcal{L}(\theta, \theta^{t})}{\partial \pi_H } &= \sum_{i=1}^n r_{iH} \frac{\partial}{\partial \pi_H} \bigg [\log \pi_H \bigg] + r_{iT} \frac{\partial}{\partial \pi_H} \bigg [\log 1-\pi_H \bigg] \\ &= \sum_{i=1}^n \frac{r_{iH}}{\pi_H} - \frac{r_{iT}}{1-\pi_H} \\ &\overset{!}{=} 0 \end{aligned}

We get

(1πH)i=1nriH=λi=1nriTi=1nriH=λ(i=1nriT+riH1)πH=1ni=1nriH,\begin{aligned} (1-\pi_H)\sum_{i=1}^n r_{iH} &= \lambda\sum_{i=1}^n r_{iT} \\ \sum_{i=1}^n r_{iH} &= \lambda\bigg(\sum_{i=1}^n \underbrace{r_{iT} + r_{iH}}_1\bigg) \\ \pi_H &= \frac 1 n \sum_{i=1}^n r_{iH}, \end{aligned}

and πT=1πH\pi_T = 1 - \pi_H.

Updating μH\mu_H and μT\mu_T

L(θ,θt1)μH=i=1nriHμH(log(mxi)+xilogμH+(mxi)log(1μH))=i=1nriH(xiμHmxi(1μH))=!0\begin{aligned} \frac{\partial \mathcal{L}(\theta, \theta^{t-1})}{\partial \mu_H} &= \sum_{i=1}^n r_{iH} \frac {\partial } {\partial \mu_H}\bigg( \log {m \choose x_i} + x_i \log \mu_H + (m-x_i)\log(1-\mu_H) \bigg ) \\ &= \sum_{i=1}^n r_{iH} \bigg(\frac{x_i}{\mu_H} - \frac{m-x_i}{(1-\mu_H)} \bigg) \\ &\overset{!}{=} 0 \end{aligned}

Therefore, we have

i=1nriHxiμH=i=1nriH(mxi)(1μH)(1μH)i=1nriHxi=μHi=1nriH(mxi)μH=i=1nriHxii=1nriHm.\begin{aligned} \sum_{i=1}^n \frac{r_{iH}x_i}{\mu_H} &= \sum_{i=1}^n \frac{r_{iH}(m-x_i)}{(1-\mu_H)} \\ (1-\mu_H) \sum_{i=1}^n {r_{iH}x_i} &= \mu_H \sum_{i=1}^n r_{iH}(m-x_i) \\ \mu_H &= \frac{\sum_{i=1}^n {r_{iH}x_i} }{\sum_{i=1}^n {r_{iH}m} }. \end{aligned}

Similarly, we have μT=i=1nriTxii=1nriTm\mu_T = \frac{\sum_{i=1}^n {r_{iT}x_i} }{\sum_{i=1}^n {r_{iT}m} }.

Fig. 3: Expectation-Maximization steps for a mixture of binomials


These are resources that I consulted while writing this article

  • Murphy (2012), "Machine Learning: A Probabilistic Perspective" (Section 11.4) for the general idea of the EM algorithm.
  • Manfred Opper's Probabilistic Bayesian Modeling course, Lecture Laplace Approximation (TU Berlin, Summer 2020) for Proof of the EM step
  • Woijeck Samek's Machine Learning 1 course, Lecture Latent Variable Models (TU Berlin, Winter 2020) for the Mixture of Binomials example

This articles' figures are made in Google Colab.

Proof: the EM procedure increases the expected complete log-likelihood

In the following, we will show that L(θ,θt+1)L(θ,θt)\mathcal{L}(\theta, \theta^{t+1}) \ge \mathcal{L}(\theta, \theta^t).

Consider q()q(\cdot) be any distribution over the latent variable ZZ. We know that p(D,zθ)=p(zD,θ)p(Dθ)p(\mathcal D, z| \theta) = p(z|\mathcal D, \theta) p(\mathcal D | \theta) . We can then compute

KL(q(Z)p(ZD,θ))=zq(x)logq(z)p(zD,θ)=zq(z)logq(z)p(Dθ)p(z,Dθ)\begin{aligned} \text{KL} ( q(Z) \| p(Z |\mathcal D, \theta)) &= \sum_{z} q(x) \log \frac{q(z)}{p(z|\mathcal D, \theta)}\\ &= \sum_{z} q(z) \log \frac{q(z) p(\mathcal D | \theta)}{p(z, \mathcal D| \theta)} \end{aligned}

Because KL()0\text{KL}(\cdot \| \cdot) \ge 0, we have

zq(z)logq(z)p(z,Dθ)=:F(q,θ)logp(Dθ)\begin{aligned} \underbrace{\sum_z q(z) \log \frac{q(z)}{p(z, \mathcal D | \theta)}}_{=: F(q, \theta) } \ge - \log p(\mathcal D | \theta) \end{aligned}

Now consider qt(z):=p(zD,θt)q_t(z) := p(z | \mathcal D, \theta^t). We have

L(θ,θt)=zqt(z)logp(z,Dθ)=zqt(z)logp(z,Dθ)qt(z)qt(z)=F(qt,θ)+zqt(z)logqt(z)\begin{aligned} \mathcal{L}(\theta, \theta^t) &= \sum_z q_t(z) \log p(z, \mathcal D | \theta) \\ &= \sum_z q_t(z) \log p(z, \mathcal D | \theta) \frac{q_t(z) }{q_t(z)} \\ &= - F(q_t, \theta) + \sum_z q_t(z) \log q_t(z) \end{aligned}

Consider the fact that F(qt,θ)logp(Dθ)F(q_t, \theta) \ge - \log p(\mathcal D| \theta). We have two cases, namely

  1. F(qt,θ)>logp(Dθ)F(q_t, \theta) > - \log p(\mathcal D|\theta)
  2. F(qt,θ)=logp(Dθ)    F(qt,θt)=logp(Dθt)F(q_t, \theta) = - \log p( \mathcal D | \theta) \implies F(q_t, \theta_t) = - \log p(\mathcal D |\theta_t).

Subtract these two conditions lead to

F(qt,θ)F(qt,θt)logp(Dθ)+logp(Dθt)F(q_t, \theta) - F(q_t, \theta_t) \ge - \log p(\mathcal D | \theta) + \log p(\mathcal D| \theta_t)

In other words, we have

logp(Dθ)logp(Dθt)>F(qt,θ)+F(qt,θt)>(F(qt,θ)F(qt,θt)()0)>0.\begin{aligned} \log p(\mathcal D | \theta) - \log p(\mathcal D | \theta^t) &> - F(q_t, \theta) + F(q_t, \theta^t) \\ &> - ( \underbrace{ F(q_t, \theta) - F(q_t, \theta^t)}_{\overset{(*)}\le 0}) \\ &> 0. \end{aligned}

()(*) is due to the fact that we maximize L(θ,θt)\mathcal{L}(\theta, \theta^t) w.r.t θ\theta, which is equivalent to minimize F(qt,θ)F(q_t, \theta).

This concludes that the EM procedure always increases the expected complete log likelihood. However, the procedure does not guarantee to reach the global optimum.