Consider a dataset $\mathcal{D} = \{ \mathbf x_i \in \reals^d\}_{i=1}^n$. Denote $\mathbf X \in \reals^{d \times n}$ be the data matrix. Without loss of generality, we assume that the data has zero mean; thus the covariance matrix is $\Sigma = \frac{1}{n} \mathbf X \mathbf X^T$.

Define $\psi(\cdot)$ be the whitening operator parameterized by $W$, which is commonly referred to whitening matrix. Let $\hat \mathbf X := \psi(\mathbf X)$. The goal of whitening is to decorate the data; that is

Recall that $\Sigma$ is symmetric and positive definite; it thus can be decomposed into

where $U$'s columns are $\Sigma$'s eigenvectors and $\Lambda$ is a diagonal matrix containing real eigenvalues $\sigma^2_i\ \forall i \in [1,d]$.

Because $U$ is a orthonormal matrix, i.e. $\mathbf u_i ^T \mathbf u_j = 0 \forall i,j \in [1, m]$ and $i\ne j$. Consider $\mathbf u_i$. The above derivation shows that

In other word, we would like to find $W$ that satisfies $W^T W= \Sigma^{-1}$. We can see this condition yields the diagonal coraviance condition:

## Approach 1: PCA-Whitening

With this fact, the natural choice of the whitening operator is

Thus, we can see that

## Approach 2: ZCA-Whitening

However, whitening is not unique because whitened data remains whitened when transform. We can impose an additional constraint on $\psi$. In particular, we can rotate the PCA-whitened data to be close to the original data. This is called Zero-Phase Component Analysis (ZCA):

In fact, one can see that $W_\text{ZCA} = \Sigma^{-1/2}$.

## Approach 3: Cholesky Decomposition

For a positive definite matrix $\Sigma$, we know that we can decompose it into

where $L$ is a lower triangular matrix. Inverting the equation above gives us

where (1) uses the fact that matrix inverse and transpose are exchangeable. Here, it is obvious that we can take

Because $L^{-1}$ is also a lower triangular matrix. This invert can be computed efficiently using forward substitution.

## Appendix

These are resources that I consult while writing this blog: