This article is a personal summary of Bishop (2006) 's Chapter 12.2.
Let x ∈ R d \mathbf x \in \reals^d x ∈ R d be an observed sample. We model it using the following linear generative model
x = W z + μ + ϵ , \mathbf x = W \mathbf z + \mu + \epsilon , x = W z + μ + ϵ , where
W ∈ R d × p W \in \reals^{d \times p } W ∈ R d × p is a mixing matrix;μ ∈ R d \mu \in \reals^d μ ∈ R d is a mean or bias;z ∈ R p ∼ p Z : = N ( 0 , I p ) \mathbf z \in \reals^p \sim p_Z := \mathcal N(0, I_p) z ∈ R p ∼ p Z : = N ( 0 , I p ) is the latent variable;ϵ ∈ R d ∼ p ϵ : = N ( 0 , σ 2 I d ) \epsilon \in \reals^d \sim p_\epsilon := \mathcal N(0, \sigma^2 I_d) ϵ ∈ R d ∼ p ϵ : = N ( 0 , σ 2 I d ) is isotropic noise.Example
The example below is a linear generative model where p = 1 p = 1 p = 1 (i.e., z ∼ N ( 0 , 1 ) z \sim \mathcal N(0, 1) z ∼ N ( 0 , 1 ) ) and d = 2 d = 2 d = 2 .
We parameterize W ∈ R 2 × 1 W \in \reals^{2 \times 1} W ∈ R 2 × 1 using the polar coordinates; that is
W = [ r cos ( θ ) r sin ( θ ) ] , \begin{aligned} W = \begin{bmatrix} r \cos(\theta)\\ r \sin(\theta) \end{bmatrix}, \end{aligned} W = [ r cos ( θ ) r sin ( θ ) ] , where r ∈ [ 0.1 , 5 ] r \in [0.1, 5] r ∈ [ 0 . 1 , 5 ] and θ ∈ [ 0 , 2 π ] \theta \in [0, 2\pi] θ ∈ [ 0 , 2 π ] . We assume μ = [ 0 , 0 ] \mu = [0, 0] μ = [ 0 , 0 ] .
The animation shows the mechanism as follows
W z Wz W z is displayed as ● ;W z + ϵ Wz + \epsilon W z + ϵ is displayed as ● .The condition distribution of x \mathbf x x given z \mathbf z z is
p X ∣ Z = N ( W z + u , σ 2 I d ) , \begin{aligned} p_{X|Z} = \mathcal N(W\mathbf z + \mathbf u,\sigma^2 I_d), \end{aligned} p X ∣ Z = N ( W z + u , σ 2 I d ) , and the marginal distribution over x \mathbf x x is
p X = ∫ p X Z ϵ ( ⋅ , z , ϵ ) d z d ϵ = ∫ p X Z ( ⋅ , z ) p ϵ ( ϵ ) d z d ϵ = ∫ p X Z ( ⋅ , z ) [ ∫ p ϵ ( ϵ ) d ϵ ] ⏟ = 1 d z = ∫ p X ∣ Z ( ⋅ ∣ z ) p Z ( z ) d z , \begin{aligned} p_X &= \int p_{XZ\epsilon}( \cdot, \mathbf z, \epsilon) d\mathbf z d\epsilon \\ &= \int p_{XZ}(\cdot, \mathbf z) p_\epsilon(\mathbf \epsilon ) d\mathbf z d\epsilon \\ &= \int p_{XZ}(\cdot, \mathbf z) \underbrace{\bigg [ \int p_\epsilon(\mathbf \epsilon ) d\epsilon \bigg] }_{=1} d\mathbf z \\ &= \int p_{X|Z}( \cdot | \mathbf z) p_Z(\mathbf z) d \mathbf z, \end{aligned} p X = ∫ p X Z ϵ ( ⋅ , z , ϵ ) d z d ϵ = ∫ p X Z ( ⋅ , z ) p ϵ ( ϵ ) d z d ϵ = ∫ p X Z ( ⋅ , z ) = 1 [ ∫ p ϵ ( ϵ ) d ϵ ] d z = ∫ p X ∣ Z ( ⋅ ∣ z ) p Z ( z ) d z , where the second step is due to the independence assumption between ( X , Z ) (X, Z) ( X , Z ) and ϵ \epsilon ϵ .
Observe that both p X ∣ Z p_{X|Z} p X ∣ Z and p Z p_Z p Z are Gaussian. We then have an analytical form for the integral. In particular, one can derive that the integral of two Gaussians is another Gaussian, i.e,
p X = N ( ν , Σ ) . p_X = \mathcal N(\nu, \Sigma). p X = N ( ν , Σ ) . We can derive ν ∈ R d \nu \in \reals^d ν ∈ R d and Σ ∈ R d × d \Sigma \in \reals^{d\times d } Σ ∈ R d × d as follows:
First, we recall that ν = E p X [ X ] . \nu = \mathbb{E}_{p_X} [X]. ν = E p X [ X ] . Thus, we have
ν = E p X [ W z + μ + ϵ ] = W E p Z [ z ] + μ + E p ϵ [ ϵ ] = μ , \begin{aligned} \nu &= \mathbb{E}_{p_X}[ W\mathbf z + \mu + \epsilon ] \\ &= W \mathbb E_{p_Z}[\mathbf z] + \mu + \mathbb E_{p_\epsilon}[\epsilon] \\ &= \mu, \end{aligned} ν = E p X [ W z + μ + ϵ ] = W E p Z [ z ] + μ + E p ϵ [ ϵ ] = μ , where the two expectations are zero by the assumptions; noting that t distributions of the expectations can derived from writing p X p_X p X as below and exchange the order of the intergals.
p X = ∫ p X Z ϵ ( ⋅ , z , ϵ ) d z d ϵ p_X = \int p_{XZ\epsilon}(\cdot, \mathbf z, \epsilon ) d\mathbf z d \epsilon p X = ∫ p X Z ϵ ( ⋅ , z , ϵ ) d z d ϵ Second, we have
Σ = E p X [ ( x − μ ) ( x − μ ) T ] = E p X [ ( W z + ϵ ) ( W z + ϵ ) T ] . \Sigma = \mathbb{E}_{p_X}[ (\mathbf x - \mu)(\mathbf x - \mu)^T] = \mathbb{E}_{p_X}[ (W\mathbf z + \epsilon) (W\mathbf z +\epsilon)^T]. Σ = E p X [ ( x − μ ) ( x − μ ) T ] = E p X [ ( W z + ϵ ) ( W z + ϵ ) T ] . Explanding the terms yields
Σ = E p X [ W z z T W T + 2 W z ϵ T + ϵ ϵ T ] = W E p Z [ z z T ] W T + 2 W E p Z ϵ [ z ϵ T ] + E p ϵ [ ϵ ϵ T ] = W I d W T + 2 W E p Z ϵ [ z ϵ T ] + σ 2 I d . \begin{aligned} \Sigma &= \mathbb E_{p_X}[ W\mathbf z \mathbf z ^TW^T + 2W\mathbf z \epsilon^T + \epsilon \epsilon^T] \\ &= W \mathbb E_{p_Z}[ \mathbf z \mathbf z^T] W^T + 2W \mathbb E_{p_{Z\epsilon}}[ \mathbf z \epsilon^T] + \mathbb E_{p_\epsilon}[ \epsilon\epsilon^T] \\ &= WI_dW^T + 2W\mathbb{E}_{p_{Z\epsilon}}[\mathbf z \epsilon^T] + \sigma^2I_d. \end{aligned} Σ = E p X [ W z z T W T + 2 W z ϵ T + ϵ ϵ T ] = W E p Z [ z z T ] W T + 2 W E p Z ϵ [ z ϵ T ] + E p ϵ [ ϵ ϵ T ] = W I d W T + 2 W E p Z ϵ [ z ϵ T ] + σ 2 I d . We know that Z ⊥ ϵ Z \perp \epsilon Z ⊥ ϵ ; hence, we have p Z ϵ ( z , ϵ ) = p Z ( z ) p ( ϵ ) p_{Z\epsilon}(\mathbf z, \epsilon) = p_Z (\mathbf z) p(\epsilon) p Z ϵ ( z , ϵ ) = p Z ( z ) p ( ϵ ) . This leads to
E p Z ϵ [ z ϵ T ] = ∫ p Z ( z ) p ϵ ( ϵ ) [ z ϵ T ] d z d ϵ = ∫ p Z ( z ) z [ ∫ p ϵ ( ϵ ) ϵ T d ϵ ] ⏟ = 0 d z = 0. \begin{aligned} \mathbb E_{p_{Z\epsilon}}[\mathbf z \epsilon^T] &= \int p_Z(\mathbf z) p_\epsilon(\epsilon) [ \mathbf z \epsilon^T ] d\mathbf z d \epsilon \\ &= \int p_Z(\mathbf z) \mathbf z \underbrace{\bigg[ \int p_\epsilon(\epsilon)\epsilon^T d\epsilon \bigg]}_{=\mathbf 0} d\mathbf z \\ &= \mathbf 0 . \end{aligned} E p Z ϵ [ z ϵ T ] = ∫ p Z ( z ) p ϵ ( ϵ ) [ z ϵ T ] d z d ϵ = ∫ p Z ( z ) z = 0 [ ∫ p ϵ ( ϵ ) ϵ T d ϵ ] d z = 0 . Thus, we have Σ = W W T + σ 2 I d \Sigma = W W^T + \sigma^2I_d Σ = W W T + σ 2 I d .
Putting the two steps together yields
p X = N ( μ , W W T + σ 2 I d ) . p_X = \mathcal N(\mu, W W^T + \sigma^2 I_d). p X = N ( μ , W W T + σ 2 I d ) . Inferring Parameters Consider a dataset D = { x i } i = 1 n \mathcal D = \{\mathbf x_i\}_{i=1}^n D = { x i } i = 1 n where x i \mathbf x_i x i 's are assumed to be idendent and identallycal distributed according to p X p_X p X . Define Θ : = { μ , Σ } \Theta := \{ \mu, \Sigma\} Θ : = { μ , Σ } be the parameters of the model. The log-likelihood of the model is
log p ( D ∣ Θ ) = ∑ i = 1 n log p X ( x i ) = ∑ i = 1 n − d 2 log 2 π − 1 2 log ∣ Σ ∣ − 1 2 ( x i − μ ) T Σ − 1 ( x i − μ ) = − n d 2 log 2 π − n 2 log ∣ Σ ∣ − 1 2 ∑ i = 1 n ( x i − μ ) T Σ − 1 ( x i − μ ) = : L ( Θ ) . \begin{aligned} \log p(\mathcal D|\Theta) &= \sum_{i=1}^n \log p_X(\mathbf x_i) \\ &= \sum_{i=1}^n - \frac{d}{2}\log 2\pi - \frac{1}{2}\log |\Sigma| - \frac{1}{2} (\mathbf x_i - \mu)^T \Sigma^{-1} (\mathbf x_i - \mu) \\ &= - \frac{nd}{2}\log 2\pi - \frac{n}{2}\log |\Sigma| - \frac{1}{2} \sum_{i=1}^n (\mathbf x_i - \mu)^T \Sigma^{-1} (\mathbf x_i - \mu) \\ &=: \mathcal L(\Theta). \end{aligned} log p ( D ∣ Θ ) = i = 1 ∑ n log p X ( x i ) = i = 1 ∑ n − 2 d log 2 π − 2 1 log ∣ Σ ∣ − 2 1 ( x i − μ ) T Σ − 1 ( x i − μ ) = − 2 n d log 2 π − 2 n log ∣ Σ ∣ − 2 1 i = 1 ∑ n ( x i − μ ) T Σ − 1 ( x i − μ ) = : L ( Θ ) . Finding Mean Taking derivative of L ( Θ ) \mathcal L (\Theta) L ( Θ ) w.r.t μ \mu μ and set it to zero yields
∇ μ L ( Θ ) = ∑ i = 1 n Σ − 1 ( x i − μ ) = ! 0 ⟹ μ ⋆ = 1 n ∑ i = 1 n x i . \begin{aligned} \nabla_\mu \mathcal L(\Theta) &= \sum_{i=1}^n \Sigma^{-1} (\mathbf x_i - \mu) \\ &\overset{!}{=} 0 \\ \implies \mu^\star &= \frac{1}{n} \sum_{i=1}^n \mathbf x_i. \end{aligned} ∇ μ L ( Θ ) ⟹ μ ⋆ = i = 1 ∑ n Σ − 1 ( x i − μ ) = ! 0 = n 1 i = 1 ∑ n x i . Finding Covariance Similarly, we have
∇ Σ L ( Θ ) = ! 0 ⟹ Σ ⋆ = 1 n ∑ i = 1 n ( x i − μ ) ( x i − μ ) T , \begin{aligned} \nabla_\Sigma \mathcal L(\Theta) &\overset{!}{=} 0 \\ \implies \Sigma^\star &= \frac{1}{n} \sum_{i=1}^n (\mathbf x_i - \mu)(\mathbf x_i - \mu)^T, \end{aligned} ∇ Σ L ( Θ ) ⟹ Σ ⋆ = ! 0 = n 1 i = 1 ∑ n ( x i − μ ) ( x i − μ ) T , which is the data covariance matrix. To find the parameters W W W and σ 2 \sigma^2 σ 2 , we first perform eigenvalue decomposition Σ ⋆ = U Λ U T \Sigma^\star = U\Lambda U^T Σ ⋆ = U Λ U T . Without loss generality, we assume that the eigenvalues and eigenvectors are sorted in descending order. We then decompose the data covariance matrix into
Σ ⋆ = U W Λ W U W T + U ϵ Λ ϵ U ϵ T , \begin{aligned} \Sigma^\star &= U_W \Lambda_W U_W^T + U_\epsilon \Lambda_\epsilon U_\epsilon^T, \end{aligned} Σ ⋆ = U W Λ W U W T + U ϵ Λ ϵ U ϵ T , where
the columns of U W ∈ R d × p U_W\in \reals^{d\times p} U W ∈ R d × p are the first p p p columns of U U U , similarly for Λ W ∈ R p × p \Lambda_W \in \reals^{p \times p } Λ W ∈ R p × p . conversely, the columns of U ϵ ∈ R d × ( d − p ) U_\epsilon \in \reals^{d \times (d-p)} U ϵ ∈ R d × ( d − p ) and Λ ϵ ∈ R ( d − p ) × ( d − p ) \Lambda_\epsilon \in \reals^{(d-p)\times (d-p)} Λ ϵ ∈ R ( d − p ) × ( d − p ) are the corresponding values are taken from U U U and Λ \Lambda Λ . Recall also that U T U = U U T = I d U^T U = UU^T =I_d U T U = U U T = I d . Thus, we have,
I d = U W U W T + U ϵ U ϵ T , I_d = U_WU_W^T + U_\epsilon U_\epsilon^T, I d = U W U W T + U ϵ U ϵ T ,
and
U W Λ W U W T + U ϵ Λ ϵ U ϵ T = W W T + σ 2 ( U W U W T + U ϵ U ϵ T ) . \begin{aligned} U_W\Lambda_W U_W^T + U_\epsilon \Lambda_\epsilon U_\epsilon^T & = WW^T +\sigma^2 \bigg( U_WU_W^T + U_\epsilon U_\epsilon^T\bigg). \end{aligned} U W Λ W U W T + U ϵ Λ ϵ U ϵ T = W W T + σ 2 ( U W U W T + U ϵ U ϵ T ) . To find W W W , we assume that the variances of the first p p p dimensions (i.e., spanned by U W ) U_W) U W ) are captured by it. Hence, collecting the terms associate to these dimensions yield
W W T = U W Λ W U W T − σ 2 U W U W T = U W ( Λ W − σ 2 I p ) ⏟ = : Γ U W T = U W Γ 1 / 2 Γ 1 / 2 U W T = U W Γ 1 / 2 ( U W Γ 1 / 2 ) T ⟹ W = U W Γ 1 / 2 . \begin{aligned} WW^T &= U_W\Lambda_WU_W^T - \sigma^2 U_WU_W^T \\ &= U_W \underbrace{(\Lambda_W - \sigma^2I_p)}_{=:\Gamma}U_W^T \\ &= U_W \Gamma^{1/2} \Gamma^{1/2}U_W^T \\ &= U_W \Gamma^{1/2} (U_W \Gamma^{1/2})^T \\ \implies W &= U_W\Gamma^{1/2}. \end{aligned} W W T ⟹ W = U W Λ W U W T − σ 2 U W U W T = U W = : Γ ( Λ W − σ 2 I p ) U W T = U W Γ 1 / 2 Γ 1 / 2 U W T = U W Γ 1 / 2 ( U W Γ 1 / 2 ) T = U W Γ 1 / 2 . To find σ 2 \sigma^2 σ 2 , we then assume it to be the average of the variances spanned by U ϵ U_\epsilon U ϵ , i.e.,
σ 2 = 1 d − p ∑ k = p + 1 d λ k , \sigma^2 = \frac{1}{d-p} \sum_{k=p+1}^d \lambda_k, σ 2 = d − p 1 k = p + 1 ∑ d λ k , where λ k ∈ R + = Λ k k \lambda_k \in \reals_+ = \Lambda_{kk} λ k ∈ R + = Λ k k .
To summarize, with the assumptions p Z p_Z p Z and p ϵ p_\epsilon p ϵ are Gaussians, we have
p X ∣ Z = N ( W z + μ , σ 2 I d ) p X = N ( μ , W W T + σ 2 I d ) , \begin{aligned} p_{X|Z} &= \mathcal N(W\mathbf z+\mu, \sigma^2I_d) \\ p_X &= \mathcal N(\mu, WW^T + \sigma^2I_d), \end{aligned} p X ∣ Z p X = N ( W z + μ , σ 2 I d ) = N ( μ , W W T + σ 2 I d ) , where the parameters of the model are
μ = 1 n ∑ i = 1 n x i Σ = 1 n ∑ i = 1 n ( x i − μ ) ( x i − μ ) T σ 2 = 1 d − p ∑ k = p + 1 d λ k W = U W ( Λ W − σ 2 I p ) 1 / 2 . \begin{aligned} \mu &= \frac{1}{n} \sum_{i=1}^n \mathbf x_i \\ \Sigma &= \frac{1}{n} \sum_{i=1}^n \mathbf (\mathbf x_i - \mu)(\mathbf x_i - \mu)^T \\ \sigma^2 &= \frac{1}{d-p} \sum_{k=p+1}^d \lambda_k \\ W &= U_W (\Lambda_W - \sigma^2I_p)^{1/2} . \end{aligned} μ Σ σ 2 W = n 1 i = 1 ∑ n x i = n 1 i = 1 ∑ n ( x i − μ ) ( x i − μ ) T = d − p 1 k = p + 1 ∑ d λ k = U W ( Λ W − σ 2 I p ) 1 / 2 . Example: Inferring Parameters of Linear Generative Model Inferring parameters of a linear generative model using training sets with various sizes.
Observe that both the data generated and inferred distribution align with the shape of Gaussian distributions.
Conclusion Linear generative models are simple generative models. They are the foundation of Probablistic PCA [Tippings and Bishop (1996)] .
Through this len, it is the modeling perspective of PCA, complementing the other traditional two views of PCA that is to find a subspace that (1) maximizes variance or (2) mimizes reconstruction error.
Acknowlegdement
Apart from Bishop (2006) 's Chapter 12.2, I also gained some intuition from watching Erik Bekkers's video .
The last figure is from a notebook on Google Colab .