HomePublicationsProjectsBlogAbout

Probabilistic PCA

August 26, 2021
Table of Content

This article is a personal summary of Bishop (2006)'s Chapter 12.2.

Formulation: Linear Generative Model

Let xRd\mathbf x \in \reals^d be an observed sample. We model it using the following linear generative model

x=Wz+μ+ϵ,\mathbf x = W \mathbf z + \mu + \epsilon ,

where

  • WRd×pW \in \reals^{d \times p } is a mixing matrix;
  • μRd\mu \in \reals^d is a mean or bias;
  • zRppZ:=N(0,Ip)\mathbf z \in \reals^p \sim p_Z := \mathcal N(0, I_p) is the latent variable;
  • ϵRdpϵ:=N(0,σ2Id)\epsilon \in \reals^d \sim p_\epsilon := \mathcal N(0, \sigma^2 I_d) is isotropic noise.

Example

The example below is a linear generative model where p=1p = 1 (i.e., zN(0,1)z \sim \mathcal N(0, 1)) and d=2d = 2. We parameterize WR2×1W \in \reals^{2 \times 1} using the polar coordinates; that is

W=[rcos(θ)rsin(θ)],\begin{aligned} W = \begin{bmatrix} r \cos(\theta)\\ r \sin(\theta) \end{bmatrix}, \end{aligned}

where r[0.1,5]r \in [0.1, 5] and θ[0,2π]\theta \in [0, 2\pi]. We assume μ=[0,0]\mu = [0, 0].

The animation shows the mechanism as follows

  • WzWz is displayed as ;
  • Wz+ϵWz + \epsilon is displayed as .

The condition distribution of x\mathbf x given z\mathbf z is

pXZ=N(Wz+u,σ2Id),\begin{aligned} p_{X|Z} = \mathcal N(W\mathbf z + \mathbf u,\sigma^2 I_d), \end{aligned}

and the marginal distribution over x\mathbf x is

pX=pXZϵ(,z,ϵ)dzdϵ=pXZ(,z)pϵ(ϵ)dzdϵ=pXZ(,z)[pϵ(ϵ)dϵ]=1dz=pXZ(z)pZ(z)dz,\begin{aligned} p_X &= \int p_{XZ\epsilon}( \cdot, \mathbf z, \epsilon) d\mathbf z d\epsilon \\ &= \int p_{XZ}(\cdot, \mathbf z) p_\epsilon(\mathbf \epsilon ) d\mathbf z d\epsilon \\ &= \int p_{XZ}(\cdot, \mathbf z) \underbrace{\bigg [ \int p_\epsilon(\mathbf \epsilon ) d\epsilon \bigg] }_{=1} d\mathbf z \\ &= \int p_{X|Z}( \cdot | \mathbf z) p_Z(\mathbf z) d \mathbf z, \end{aligned}

where the second step is due to the independence assumption between (X,Z)(X, Z) and ϵ\epsilon .

Observe that both pXZp_{X|Z} and pZp_Z are Gaussian. We then have an analytical form for the integral. In particular, one can derive that the integral of two Gaussians is another Gaussian, i.e,

pX=N(ν,Σ).p_X = \mathcal N(\nu, \Sigma).

We can derive νRd\nu \in \reals^d and ΣRd×d\Sigma \in \reals^{d\times d } as follows:

  • First, we recall that ν=EpX[X].\nu = \mathbb{E}_{p_X} [X]. Thus, we have

    ν=EpX[Wz+μ+ϵ]=WEpZ[z]+μ+Epϵ[ϵ]=μ,\begin{aligned} \nu &= \mathbb{E}_{p_X}[ W\mathbf z + \mu + \epsilon ] \\ &= W \mathbb E_{p_Z}[\mathbf z] + \mu + \mathbb E_{p_\epsilon}[\epsilon] \\ &= \mu, \end{aligned}

    where the two expectations are zero by the assumptions; noting that t distributions of the expectations can derived from writing pXp_X as below and exchange the order of the intergals.

    pX=pXZϵ(,z,ϵ)dzdϵp_X = \int p_{XZ\epsilon}(\cdot, \mathbf z, \epsilon ) d\mathbf z d \epsilon
  • Second, we have

    Σ=EpX[(xμ)(xμ)T]=EpX[(Wz+ϵ)(Wz+ϵ)T].\Sigma = \mathbb{E}_{p_X}[ (\mathbf x - \mu)(\mathbf x - \mu)^T] = \mathbb{E}_{p_X}[ (W\mathbf z + \epsilon) (W\mathbf z +\epsilon)^T].

    Explanding the terms yields

    Σ=EpX[WzzTWT+2WzϵT+ϵϵT]=WEpZ[zzT]WT+2WEpZϵ[zϵT]+Epϵ[ϵϵT]=WIdWT+2WEpZϵ[zϵT]+σ2Id.\begin{aligned} \Sigma &= \mathbb E_{p_X}[ W\mathbf z \mathbf z ^TW^T + 2W\mathbf z \epsilon^T + \epsilon \epsilon^T] \\ &= W \mathbb E_{p_Z}[ \mathbf z \mathbf z^T] W^T + 2W \mathbb E_{p_{Z\epsilon}}[ \mathbf z \epsilon^T] + \mathbb E_{p_\epsilon}[ \epsilon\epsilon^T] \\ &= WI_dW^T + 2W\mathbb{E}_{p_{Z\epsilon}}[\mathbf z \epsilon^T] + \sigma^2I_d. \end{aligned}

    We know that ZϵZ \perp \epsilon ; hence, we have pZϵ(z,ϵ)=pZ(z)p(ϵ)p_{Z\epsilon}(\mathbf z, \epsilon) = p_Z (\mathbf z) p(\epsilon). This leads to

    EpZϵ[zϵT]=pZ(z)pϵ(ϵ)[zϵT]dzdϵ=pZ(z)z[pϵ(ϵ)ϵTdϵ]=0dz=0.\begin{aligned} \mathbb E_{p_{Z\epsilon}}[\mathbf z \epsilon^T] &= \int p_Z(\mathbf z) p_\epsilon(\epsilon) [ \mathbf z \epsilon^T ] d\mathbf z d \epsilon \\ &= \int p_Z(\mathbf z) \mathbf z \underbrace{\bigg[ \int p_\epsilon(\epsilon)\epsilon^T d\epsilon \bigg]}_{=\mathbf 0} d\mathbf z \\ &= \mathbf 0 . \end{aligned}

    Thus, we have Σ=WWT+σ2Id\Sigma = W W^T + \sigma^2I_d.

Putting the two steps together yields

pX=N(μ,WWT+σ2Id).p_X = \mathcal N(\mu, W W^T + \sigma^2 I_d).

Inferring Parameters

Consider a dataset D={xi}i=1n\mathcal D = \{\mathbf x_i\}_{i=1}^n where xi\mathbf x_i's are assumed to be idendent and identallycal distributed according to pXp_X. Define Θ:={μ,Σ}\Theta := \{ \mu, \Sigma\} be the parameters of the model. The log-likelihood of the model is

logp(DΘ)=i=1nlogpX(xi)=i=1nd2log2π12logΣ12(xiμ)TΣ1(xiμ)=nd2log2πn2logΣ12i=1n(xiμ)TΣ1(xiμ)=:L(Θ).\begin{aligned} \log p(\mathcal D|\Theta) &= \sum_{i=1}^n \log p_X(\mathbf x_i) \\ &= \sum_{i=1}^n - \frac{d}{2}\log 2\pi - \frac{1}{2}\log |\Sigma| - \frac{1}{2} (\mathbf x_i - \mu)^T \Sigma^{-1} (\mathbf x_i - \mu) \\ &= - \frac{nd}{2}\log 2\pi - \frac{n}{2}\log |\Sigma| - \frac{1}{2} \sum_{i=1}^n (\mathbf x_i - \mu)^T \Sigma^{-1} (\mathbf x_i - \mu) \\ &=: \mathcal L(\Theta). \end{aligned}

Finding Mean

Taking derivative of L(Θ)\mathcal L (\Theta) w.r.t μ\mu and set it to zero yields

μL(Θ)=i=1nΣ1(xiμ)=!0    μ=1ni=1nxi.\begin{aligned} \nabla_\mu \mathcal L(\Theta) &= \sum_{i=1}^n \Sigma^{-1} (\mathbf x_i - \mu) \\ &\overset{!}{=} 0 \\ \implies \mu^\star &= \frac{1}{n} \sum_{i=1}^n \mathbf x_i. \end{aligned}

Finding Covariance

Similarly, we have

ΣL(Θ)=!0    Σ=1ni=1n(xiμ)(xiμ)T,\begin{aligned} \nabla_\Sigma \mathcal L(\Theta) &\overset{!}{=} 0 \\ \implies \Sigma^\star &= \frac{1}{n} \sum_{i=1}^n (\mathbf x_i - \mu)(\mathbf x_i - \mu)^T, \end{aligned}

which is the data covariance matrix. To find the parameters WW and σ2\sigma^2, we first perform eigenvalue decomposition Σ=UΛUT\Sigma^\star = U\Lambda U^T. Without loss generality, we assume that the eigenvalues and eigenvectors are sorted in descending order. We then decompose the data covariance matrix into

Σ=UWΛWUWT+UϵΛϵUϵT,\begin{aligned} \Sigma^\star &= U_W \Lambda_W U_W^T + U_\epsilon \Lambda_\epsilon U_\epsilon^T, \end{aligned}

where

  • the columns of UWRd×pU_W\in \reals^{d\times p} are the first pp columns of UU, similarly for ΛWRp×p\Lambda_W \in \reals^{p \times p }.
  • conversely, the columns of UϵRd×(dp)U_\epsilon \in \reals^{d \times (d-p)} and ΛϵR(dp)×(dp)\Lambda_\epsilon \in \reals^{(d-p)\times (d-p)} are the corresponding values are taken from UU and Λ\Lambda.

Recall also that UTU=UUT=IdU^T U = UU^T =I_d. Thus, we have,

Id=UWUWT+UϵUϵT,I_d = U_WU_W^T + U_\epsilon U_\epsilon^T,

and

UWΛWUWT+UϵΛϵUϵT=WWT+σ2(UWUWT+UϵUϵT).\begin{aligned} U_W\Lambda_W U_W^T + U_\epsilon \Lambda_\epsilon U_\epsilon^T & = WW^T +\sigma^2 \bigg( U_WU_W^T + U_\epsilon U_\epsilon^T\bigg). \end{aligned}

To find WW, we assume that the variances of the first pp dimensions (i.e., spanned by UW)U_W) are captured by it. Hence, collecting the terms associate to these dimensions yield

WWT=UWΛWUWTσ2UWUWT=UW(ΛWσ2Ip)=:ΓUWT=UWΓ1/2Γ1/2UWT=UWΓ1/2(UWΓ1/2)T    W=UWΓ1/2.\begin{aligned} WW^T &= U_W\Lambda_WU_W^T - \sigma^2 U_WU_W^T \\ &= U_W \underbrace{(\Lambda_W - \sigma^2I_p)}_{=:\Gamma}U_W^T \\ &= U_W \Gamma^{1/2} \Gamma^{1/2}U_W^T \\ &= U_W \Gamma^{1/2} (U_W \Gamma^{1/2})^T \\ \implies W &= U_W\Gamma^{1/2}. \end{aligned}

To find σ2\sigma^2, we then assume it to be the average of the variances spanned by UϵU_\epsilon, i.e.,

σ2=1dpk=p+1dλk,\sigma^2 = \frac{1}{d-p} \sum_{k=p+1}^d \lambda_k,

where λkR+=Λkk\lambda_k \in \reals_+ = \Lambda_{kk}.

To summarize, with the assumptions pZp_Z and pϵp_\epsilon are Gaussians, we have

pXZ=N(Wz+μ,σ2Id)pX=N(μ,WWT+σ2Id),\begin{aligned} p_{X|Z} &= \mathcal N(W\mathbf z+\mu, \sigma^2I_d) \\ p_X &= \mathcal N(\mu, WW^T + \sigma^2I_d), \end{aligned}

where the parameters of the model are

μ=1ni=1nxiΣ=1ni=1n(xiμ)(xiμ)Tσ2=1dpk=p+1dλkW=UW(ΛWσ2Ip)1/2.\begin{aligned} \mu &= \frac{1}{n} \sum_{i=1}^n \mathbf x_i \\ \Sigma &= \frac{1}{n} \sum_{i=1}^n \mathbf (\mathbf x_i - \mu)(\mathbf x_i - \mu)^T \\ \sigma^2 &= \frac{1}{d-p} \sum_{k=p+1}^d \lambda_k \\ W &= U_W (\Lambda_W - \sigma^2I_p)^{1/2} . \end{aligned}

Example: Inferring Parameters of Linear Generative Model

Inferring parameters of a linear generative model using training sets with various sizes.

Observe that both the data generated and inferred distribution align with the shape of Gaussian distributions.

Conclusion

Linear generative models are simple generative models. They are the foundation of Probablistic PCA [Tippings and Bishop (1996)]. Through this len, it is the modeling perspective of PCA, complementing the other traditional two views of PCA that is to find a subspace that (1) maximizes variance or (2) mimizes reconstruction error.

Acknowlegdement

Apart from Bishop (2006)'s Chapter 12.2, I also gained some intuition from watching Erik Bekkers's video.

The last figure is from a notebook on Google Colab.