Background : Taylor Expansion Consider a function f : X → R f: \mathcal X \rightarrow \reals f : X → R and x ∈ X . \mathbf x \in \mathcal X. x ∈ X . Let's say we would to approximate f ( x ) f(x) f ( x ) around a local maximum value of f f f . We start by finding the optimal value using calculus, i.e. f ′ ( x 0 ) = 0 f'(x_0) = 0 f ′ ( x 0 ) = 0 ; note also that at x 0 x_0 x 0 we have f ′ ′ ( x 0 ) < 0 f''(x_0) < 0 f ′ ′ ( x 0 ) < 0 . We then do the Taylor expansion to the 2nd order around x 0 x_0 x 0 :
f ( x ) ≈ f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + 1 2 f ′ ′ ( x 0 ) ( x − x 0 ) 2 = f ( x 0 ) + 1 2 f ′ ′ ( x 0 ) ( x − x 0 ) 2 . \begin{aligned} f(x) &\approx f(x_0) + f'(x_0) (x - x_0) + \frac{1}{2} f''(x_0) ( x- x_0)^2 \\ &= f(x_0) + \frac{1}{2} f''(x_0) (x- x_0)^2. \end{aligned} f ( x ) ≈ f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + 2 1 f ′ ′ ( x 0 ) ( x − x 0 ) 2 = f ( x 0 ) + 2 1 f ′ ′ ( x 0 ) ( x − x 0 ) 2 . Let's try to approximate f ( x ) = e sin x x f(x) = e^{\frac{\sin x}{x}} f ( x ) = e x s i n x . For this function, x 0 = 1 x_0 = 1 x 0 = 1 and f ′ ′ ( x 0 ) = − 1 / 3 f''(x_0) = -1/3 f ′ ′ ( x 0 ) = − 1 / 3 . From derivation, we have
f ( x ) ≈ f ( x 0 ) + 1 2 f ′ ′ ( x 0 ) ( x − x 0 ) 2 = 1 − 1 6 ( x − x 0 ) 2 \begin{aligned} f(x) &\approx f(x_0) + \frac{1}{2} f''(x_0) (x- x_0)^2 \\ &= 1 - \frac{1}{6} ( x- x_0)^2 \end{aligned} f ( x ) ≈ f ( x 0 ) + 2 1 f ′ ′ ( x 0 ) ( x − x 0 ) 2 = 1 − 6 1 ( x − x 0 ) 2 Fig 1: Approximating a function with Taylor expansion.
Laplace Approximation In brief, Laplace (sometimes also called Gaussian) approximation applies the Taylor expansion onf ( w ) f(w) f ( w ) to compute
p ( w ) = e f ( w ) Z p(w) = \frac{e^{f(w)}}{Z} p ( w ) = Z e f ( w ) such that ∫ p ( w ) d w = 1 \int p(w) dw = 1 ∫ p ( w ) d w = 1 . In other words, Z = ∫ − ∞ ∞ e f ( w ) d w Z= \int_{-\infty}^\infty e^{f(w)} dw Z = ∫ − ∞ ∞ e f ( w ) d w and e f ( w ) e^{f(w)} e f ( w ) is a likelihood.
Let's assume that we only care about the behaviour of p ( w ) p(w) p ( w ) around the mode w MAP w_\text{MAP} w MAP . Define γ : = − f ′ ′ ( x 0 ) \gamma := - f''(x_0) γ : = − f ′ ′ ( x 0 ) . For the previous section, we know that
e f ( w ) ≈ e f ( w MAP ) e − 1 2 γ ( w − w MAP ) 2 \begin{aligned} e^{f(w)} &\approx e^{f(w_\text{MAP})} e^{-\frac{1}{2}\gamma (w - w_\text{MAP})^2 } \end{aligned} e f ( w ) ≈ e f ( w MAP ) e − 2 1 γ ( w − w MAP ) 2 Thus, we get
Z = e f ( w MAP ) 2 π γ . Z = e^{f(w_\text{MAP})} \sqrt{\frac{2\pi}\gamma}. Z = e f ( w MAP ) γ 2 π . In other word, p ( w ) ≈ p ^ ( w ) = N ( w MAP , σ 2 = ( 1 / γ 2 ) ) p(w) \approx \hat p(w)= \mathcal N(w_\text{MAP}, \sigma^2= (1/\gamma^2)) p ( w ) ≈ p ^ ( w ) = N ( w MAP , σ 2 = ( 1 / γ 2 ) ) .
Application 1: Approximating Posterior Distributions Consider a dataset D = { ( x i , y i ) } i = 1 n \mathcal D = \{ ( \mathbf x_i, y_i)\}_{i=1}^n D = { ( x i , y i ) } i = 1 n and x i , w ∈ R d \mathbf x_i, \mathbf w \in \reals^d x i , w ∈ R d . We want to estimate estimate the posterior
p ( w ∣ D ) = p ( D ∣ w ) p ( w ) p ( D ) . p(\mathbf w | \mathcal D) = \frac{ p(\mathcal D | \mathbf w) p(\mathbf w)}{p(\mathcal D)} . p ( w ∣ D ) = p ( D ) p ( D ∣ w ) p ( w ) . Using the identity e log a = a e^{\log a} =a e l o g a = a , we can write
p ( w ∣ D ) = 1 Z e − f ( w ) , p(\mathbf w | \mathcal D) = \frac{1}{Z} e^{- f(\mathbf w)}, p ( w ∣ D ) = Z 1 e − f ( w ) , where f ( w ) = − ( log p ( D ∣ w ) + log p ( w ) ) f(\mathbf w) =- (\log p(\mathcal D | \mathbf w) + \log p(\mathbf w) ) f ( w ) = − ( log p ( D ∣ w ) + log p ( w ) ) and Z = p ( D ) Z = p(\mathcal D) Z = p ( D ) . For the sake of completeness, we repeat the same derivation again but for the d d d dimensional case. We perform the Taylor explanation on f ( w ) f (\mathbf w) f ( w ) at the mode w MAP \mathbf w_\text{MAP} w MAP , yielding
f ( w ) ≈ f ( w MAP ) + ∇ f ( w MAP ) T ( w − w MAP ) + 1 2 ( w − w MAP ) T H w MAP ( w − w MAP ) , \begin{aligned} f(\mathbf w) \approx f (\mathbf w_\text{MAP}) + \cancel{\nabla f(\mathbf w_\text{MAP})^T ( \mathbf w - \mathbf w_\text{MAP})} + \frac{1}{2} (\mathbf w - \mathbf w_\text{MAP})^T \mathbf H_{\mathbf w_\text{MAP}} (\mathbf w - \mathbf w_\text{MAP}), \end{aligned} f ( w ) ≈ f ( w MAP ) + ∇ f ( w MAP ) T ( w − w MAP ) + 2 1 ( w − w MAP ) T H w MAP ( w − w MAP ) , where the first order is zero by construction and H w MAP : = ∇ 2 f ( w MAP ) \mathbf H_{\mathbf w_\text{MAP}} := \nabla^2 f(\mathbf w_\text{MAP}) H w MAP : = ∇ 2 f ( w MAP ) . Therefore, the posterior is
p ^ ( w ∣ D ) ≈ 1 Z e f ( w MAP ) e − 1 2 ( w − w MAP ) T H w MAP ( w − w MAP ) , \hat p(\mathbf w | \mathcal D) \approx \frac{1}{Z} e^{f(\mathbf w_\text{MAP})} e^{ -\frac{1}{2} (\mathbf w - \mathbf w_\text{MAP})^T \mathbf H_{\mathbf w_\text{MAP}}(\mathbf w - \mathbf w_\text{MAP}) }, p ^ ( w ∣ D ) ≈ Z 1 e f ( w MAP ) e − 2 1 ( w − w MAP ) T H w MAP ( w − w MAP ) , which is in the form of a Gaussian distribution. If we set
Z = p ( D ) = e − f ( w MAP ) ( 2 π ) d / 2 ∣ H w MAP ∣ − 1 / 2 , Z = p(\mathcal D) = e^{-f(\mathbf w_\text{MAP})} (2\pi)^{d/2}| \mathbf H_{\mathbf w_\text{MAP}} |^{-1/2}, Z = p ( D ) = e − f ( w MAP ) ( 2 π ) d / 2 ∣ H w MAP ∣ − 1 / 2 , we get
p ^ ( w ∣ D ) = N ( w MAP , H w MAP − 1 ) . \hat p(\mathbf w | \mathcal D) = \mathcal N( \mathbf w_\text{MAP}, \mathbf H_{\mathbf w_\text{MAP}}^{-1}). p ^ ( w ∣ D ) = N ( w MAP , H w MAP − 1 ) . Fig. 2: Two posterior approximations using Laplace approximation.
Proton Counter Problem (Mackay (2013), Ex. 27.1) Define r ∈ N r \in \mathbb N r ∈ N the number of protons measured and we want to infer their arrival rate λ ∈ R + \lambda \in \reals_+ λ ∈ R + . We assume that r r r follows a Poisson distribution, hence
p ( r ∣ λ ) = e − λ λ r r ! . p(r|\lambda) = e^{-\lambda} \frac{\lambda^r}{r!}. p ( r ∣ λ ) = e − λ r ! λ r . We further assume that we have an improper prior, p ( λ ) = 1 / λ . p(\lambda) = 1/\lambda. p ( λ ) = 1 / λ . Define
f ( λ ) : = − log p ( r ∣ λ ) p ( λ ) = λ − log λ r + log r ! − log λ − 1 . = λ − log λ r − 1 + log r ! . \begin{aligned} f(\lambda) &: = -\log p(r|\lambda)p(\lambda) \\ &= \lambda - \log \lambda^r + \log r! - \log \lambda^{-1}. \\ &= \lambda - \log \lambda^{r-1} + \log r!. \end{aligned} f ( λ ) : = − log p ( r ∣ λ ) p ( λ ) = λ − log λ r + log r ! − log λ − 1 . = λ − log λ r − 1 + log r ! . Taking derivate f ′ ( λ ) = ! 0 f'(\lambda) \overset{!}{=} 0 f ′ ( λ ) = ! 0 yields
λ MAP = r − 1. \lambda_\text{MAP} = r-1. λ MAP = r − 1 . We also have f ′ ′ ( λ ) = r − 1 λ 2 f''(\lambda) = \frac{r-1}{\lambda^2} f ′ ′ ( λ ) = λ 2 r − 1 , thus
f ′ ′ ( λ MAP ) = 1 r − 1 = 1 λ MAP . f''(\lambda_\text{MAP}) = \frac{1}{r-1} = \frac{1}{\lambda_\text{MAP}}. f ′ ′ ( λ MAP ) = r − 1 1 = λ MAP 1 . Therefore, the approximated posterior is
p ^ ( λ ∣ r ) = N ( μ , σ 2 ) , \hat p(\lambda | r) = \mathcal N(\mu, \sigma^2), p ^ ( λ ∣ r ) = N ( μ , σ 2 ) , where μ : = λ MAP = r − 1 \mu := \lambda_\text{MAP} = r-1 μ : = λ MAP = r − 1 and σ 2 : = 1 / f ′ ′ ( λ MAP ) = λ MAP 2 \sigma^2 := 1/f''(\lambda_\text{MAP}) = \lambda_\text{MAP}^2 σ 2 : = 1 / f ′ ′ ( λ MAP ) = λ MAP 2 .
Consider the Euler Gamma function
Γ ( t ) = ∫ 0 ∞ x t − 1 e − x d x , \Gamma(t) = \int_{0}^{\infty} x^{t-1} e^{-x} dx, Γ ( t ) = ∫ 0 ∞ x t − 1 e − x d x , ∀ t > 0 \forall t > 0 ∀ t > 0 . We can see that Γ ( t + 1 ) = t Γ ( t ) \Gamma(t+1) = t\Gamma(t) Γ ( t + 1 ) = t Γ ( t ) and Γ ( 1 ) = 1 \Gamma(1) = 1 Γ ( 1 ) = 1 ; in other words, this function is the factorial function
t ! = Γ ( t + 1 ) . t!= \Gamma(t+1). t ! = Γ ( t + 1 ) . Although we know the formula for the factorial, our goal is to approximate the function with a closed form without explicitly computing the factorial. We start by defining a new variable
y : = − ln x , \begin{aligned} y := -\ln x, \end{aligned} y : = − ln x , thus x = e y x = e^y x = e y , d x = e y d y dx = e^y dy d x = e y d y , and y ∈ R y \in \reals y ∈ R . Using integration by substitution yields
∫ 0 ∞ x t − 1 e − x d x = ∫ − ∞ ∞ e ( t − 1 ) y e − e y e y d y = ∫ − ∞ ∞ e g ( y ) d y , \begin{aligned} \int_0^\infty x^{t-1} e^{-x} dx &= \int_{-\infty}^{\infty} e^{(t-1)y} e^{-e^y} e^y dy \\ &= \int_{-\infty}^{\infty} e^{g(y)}dy, \end{aligned} ∫ 0 ∞ x t − 1 e − x d x = ∫ − ∞ ∞ e ( t − 1 ) y e − e y e y d y = ∫ − ∞ ∞ e g ( y ) d y , where g ( y ) = ( t − 1 ) y + y − e y = t y − e y g(y) = (t-1)y + y - e^y = ty - e^y g ( y ) = ( t − 1 ) y + y − e y = t y − e y . Computing the first and second derivatives, we get
g ′ ( y ) = t − e y g'(y) = t - e^y g ′ ( y ) = t − e y ; setting it to zero yields the maximum value y 0 : = ln t y_0 := \ln t y 0 : = ln t .g ′ ′ ( y ) = e y g''(y) = e^y g ′ ′ ( y ) = e y ; therefore, the second derivative at y 0 y_0 y 0 is g ′ ′ ( y 0 ) = t g''(y_0) = t g ′ ′ ( y 0 ) = t .Therefore, using the Laplace approximation, we arrive at
∫ − ∞ ∞ e g ( y ) d y ≈ e g ( y 0 ) 2 π t = e ( t ln t − t ) 2 π t = 2 π e − t t t − 1 2 . \begin{aligned} \int_{-\infty}^{\infty} e^{g(y)} dy & \approx e^{g(y_0)} \sqrt {\frac{2\pi}{t} } \\ &= e^{(t \ln t - t)} \sqrt {\frac{2\pi}{t} }\\ &= \sqrt{2\pi} e^{-t} t^{t-\frac{1}{2}}. \end{aligned} ∫ − ∞ ∞ e g ( y ) d y ≈ e g ( y 0 ) t 2 π = e ( t l n t − t ) t 2 π = 2 π e − t t t − 2 1 . With this approximation, we can compute
t ! = t Γ ( t ) ≈ t ( 2 π e − t t t − 1 2 ) = 2 π t ( t e ) t . \begin{aligned} t! &= t \Gamma(t) \\ &\approx t\bigg(\sqrt{2\pi} e^{-t} t^{t-\frac{1}{2}}\bigg) \\ &= \sqrt{2\pi t} \bigg( \frac{t}{e}\bigg)^t. \end{aligned} t ! = t Γ ( t ) ≈ t ( 2 π e − t t t − 2 1 ) = 2 π t ( e t ) t . Fig. 3: Relative error of t! approximation
Conclusion One obvious limitation of the Laplace approximation is inherit from the Taylor expansion that it captures only the local behaviour of the function near the root point: in the case of approximating the posterior distribution, this root point is the mode of the distribution.
I consulted the following materials while writing this article:
Figures in this article are made using Google Colab .