Laplace Approximation

March 06, 2021

Background : Taylor Expansion

Consider a function f:XRf: \mathcal X \rightarrow \reals and xX.\mathbf x \in \mathcal X. Let's say we would to approximate f(x)f(x) around a local maximum value of ff. We start by finding the optimal value using calculus, i.e. f(x0)=0f'(x_0) = 0; note also that at x0x_0 we have f(x0)<0f''(x_0) < 0. We then do the Taylor expansion to the 2nd order around x0x_0:

f(x)f(x0)+f(x0)(xx0)+12f(x0)(xx0)2=f(x0)+12f(x0)(xx0)2.\begin{aligned} f(x) &\approx f(x_0) + f'(x_0) (x - x_0) + \frac{1}{2} f''(x_0) ( x- x_0)^2 \\ &= f(x_0) + \frac{1}{2} f''(x_0) (x- x_0)^2. \end{aligned}

Let's try to approximate f(x)=esinxxf(x) = e^{\frac{\sin x}{x}} . For this function, x0=1x_0 = 1 and f(x0)=1/3f''(x_0) = -1/3. From derivation, we have

f(x)f(x0)+12f(x0)(xx0)2=116(xx0)2\begin{aligned} f(x) &\approx f(x_0) + \frac{1}{2} f''(x_0) (x- x_0)^2 \\ &= 1 - \frac{1}{6} ( x- x_0)^2 \end{aligned}
Fig 1: Approximating a function with Taylor expansion.

Laplace Approximation

In brief, Laplace (sometimes also called Gaussian) approximation applies the Taylor expansion onf(w)f(w) to compute

p(w)=ef(w)Zp(w) = \frac{e^{f(w)}}{Z}

such that p(w)dw=1\int p(w) dw = 1. In other words, Z=ef(w)dwZ= \int_{-\infty}^\infty e^{f(w)} dw and ef(w)e^{f(w)} is a likelihood.

Let's assume that we only care about the behaviour of p(w)p(w) around the mode wMAPw_\text{MAP}. Define γ:=f(x0)\gamma := - f''(x_0). For the previous section, we know that

ef(w)ef(wMAP)e12γ(wwMAP)2\begin{aligned} e^{f(w)} &\approx e^{f(w_\text{MAP})} e^{-\frac{1}{2}\gamma (w - w_\text{MAP})^2 } \end{aligned}

Thus, we get

Z=ef(wMAP)2πγ.Z = e^{f(w_\text{MAP})} \sqrt{\frac{2\pi}\gamma}.

In other word, p(w)p^(w)=N(wMAP,σ2=(1/γ2))p(w) \approx \hat p(w)= \mathcal N(w_\text{MAP}, \sigma^2= (1/\gamma^2)).

Application 1: Approximating Posterior Distributions

Consider a dataset D={(xi,yi)}i=1n\mathcal D = \{ ( \mathbf x_i, y_i)\}_{i=1}^n and xi,wRd\mathbf x_i, \mathbf w \in \reals^d. We want to estimate estimate the posterior

p(wD)=p(Dw)p(w)p(D).p(\mathbf w | \mathcal D) = \frac{ p(\mathcal D | \mathbf w) p(\mathbf w)}{p(\mathcal D)} .

Using the identity eloga=ae^{\log a} =a, we can write

p(wD)=1Zef(w),p(\mathbf w | \mathcal D) = \frac{1}{Z} e^{- f(\mathbf w)},

where f(w)=(logp(Dw)+logp(w))f(\mathbf w) =- (\log p(\mathcal D | \mathbf w) + \log p(\mathbf w) ) and Z=p(D)Z = p(\mathcal D). For the sake of completeness, we repeat the same derivation again but for the dd dimensional case. We perform the Taylor explanation on f(w)f (\mathbf w) at the mode wMAP\mathbf w_\text{MAP} , yielding

f(w)f(wMAP)+f(wMAP)T(wwMAP)+12(wwMAP)THwMAP(wwMAP),\begin{aligned} f(\mathbf w) \approx f (\mathbf w_\text{MAP}) + \cancel{\nabla f(\mathbf w_\text{MAP})^T ( \mathbf w - \mathbf w_\text{MAP})} + \frac{1}{2} (\mathbf w - \mathbf w_\text{MAP})^T \mathbf H_{\mathbf w_\text{MAP}} (\mathbf w - \mathbf w_\text{MAP}), \end{aligned}

where the first order is zero by construction and HwMAP:=2f(wMAP)\mathbf H_{\mathbf w_\text{MAP}} := \nabla^2 f(\mathbf w_\text{MAP}). Therefore, the posterior is

p^(wD)1Zef(wMAP)e12(wwMAP)THwMAP(wwMAP),\hat p(\mathbf w | \mathcal D) \approx \frac{1}{Z} e^{f(\mathbf w_\text{MAP})} e^{ -\frac{1}{2} (\mathbf w - \mathbf w_\text{MAP})^T \mathbf H_{\mathbf w_\text{MAP}}(\mathbf w - \mathbf w_\text{MAP}) },

which is in the form of a Gaussian distribution. If we set

Z=p(D)=ef(wMAP)(2π)d/2HwMAP1/2,Z = p(\mathcal D) = e^{-f(\mathbf w_\text{MAP})} (2\pi)^{d/2}| \mathbf H_{\mathbf w_\text{MAP}} |^{-1/2},

we get

p^(wD)=N(wMAP,HwMAP1).\hat p(\mathbf w | \mathcal D) = \mathcal N( \mathbf w_\text{MAP}, \mathbf H_{\mathbf w_\text{MAP}}^{-1}).
Fig. 2: Two posterior approximations using Laplace approximation.

Proton Counter Problem (Mackay (2013), Ex. 27.1)

Define rNr \in \mathbb N the number of protons measured and we want to infer their arrival rate λR+\lambda \in \reals_+. We assume that rr follows a Poisson distribution, hence

p(rλ)=eλλrr!.p(r|\lambda) = e^{-\lambda} \frac{\lambda^r}{r!}.

We further assume that we have an improper prior, p(λ)=1/λ.p(\lambda) = 1/\lambda. Define

f(λ):=logp(rλ)p(λ)=λlogλr+logr!logλ1.=λlogλr1+logr!.\begin{aligned} f(\lambda) &: = -\log p(r|\lambda)p(\lambda) \\ &= \lambda - \log \lambda^r + \log r! - \log \lambda^{-1}. \\ &= \lambda - \log \lambda^{r-1} + \log r!. \end{aligned}

Taking derivate f(λ)=!0f'(\lambda) \overset{!}{=} 0 yields

λMAP=r1.\lambda_\text{MAP} = r-1.

We also have f(λ)=r1λ2f''(\lambda) = \frac{r-1}{\lambda^2}, thus

f(λMAP)=1r1=1λMAP.f''(\lambda_\text{MAP}) = \frac{1}{r-1} = \frac{1}{\lambda_\text{MAP}}.

Therefore, the approximated posterior is

p^(λr)=N(μ,σ2),\hat p(\lambda | r) = \mathcal N(\mu, \sigma^2),

where μ:=λMAP=r1\mu := \lambda_\text{MAP} = r-1 and σ2:=1/f(λMAP)=λMAP2\sigma^2 := 1/f''(\lambda_\text{MAP}) = \lambda_\text{MAP}^2.

Application 2: Stirling’s formula

Consider the Euler Gamma function

Γ(t)=0xt1exdx,\Gamma(t) = \int_{0}^{\infty} x^{t-1} e^{-x} dx,

t>0\forall t > 0. We can see that Γ(t+1)=tΓ(t)\Gamma(t+1) = t\Gamma(t) and Γ(1)=1\Gamma(1) = 1; in other words, this function is the factorial function

t!=Γ(t+1).t!= \Gamma(t+1).

Although we know the formula for the factorial, our goal is to approximate the function with a closed form without explicitly computing the factorial. We start by defining a new variable

y:=lnx,\begin{aligned} y := -\ln x, \end{aligned}

thus x=eyx = e^y , dx=eydydx = e^y dy, and yRy \in \reals. Using integration by substitution yields

0xt1exdx=e(t1)yeeyeydy=eg(y)dy,\begin{aligned} \int_0^\infty x^{t-1} e^{-x} dx &= \int_{-\infty}^{\infty} e^{(t-1)y} e^{-e^y} e^y dy \\ &= \int_{-\infty}^{\infty} e^{g(y)}dy, \end{aligned}

where g(y)=(t1)y+yey=tyeyg(y) = (t-1)y + y - e^y = ty - e^y. Computing the first and second derivatives, we get

  • g(y)=teyg'(y) = t - e^y; setting it to zero yields the maximum value y0:=lnty_0 := \ln t.
  • g(y)=eyg''(y) = e^y ; therefore, the second derivative at y0y_0 is g(y0)=tg''(y_0) = t.

Therefore, using the Laplace approximation, we arrive at

eg(y)dyeg(y0)2πt=e(tlntt)2πt=2πettt12.\begin{aligned} \int_{-\infty}^{\infty} e^{g(y)} dy & \approx e^{g(y_0)} \sqrt {\frac{2\pi}{t} } \\ &= e^{(t \ln t - t)} \sqrt {\frac{2\pi}{t} }\\ &= \sqrt{2\pi} e^{-t} t^{t-\frac{1}{2}}. \end{aligned}

With this approximation, we can compute

t!=tΓ(t)t(2πettt12)=2πt(te)t.\begin{aligned} t! &= t \Gamma(t) \\ &\approx t\bigg(\sqrt{2\pi} e^{-t} t^{t-\frac{1}{2}}\bigg) \\ &= \sqrt{2\pi t} \bigg( \frac{t}{e}\bigg)^t. \end{aligned}
Fig. 3: Relative error of t! approximation


One obvious limitation of the Laplace approximation is inherit from the Taylor expansion that it captures only the local behaviour of the function near the root point: in the case of approximating the posterior distribution, this root point is the mode of the distribution.

I consulted the following materials while writing this article:

Figures in this article are made using Google Colab.