Modeling Uncertainty in Neural Networks

This article is a personal summary of Y. Gal's MLSS 2019 Moscow, Bayesian Deep Learning (Slide Deck 2).


Consider a dataset D={(xi,yi)}i=1n\mathcal D = \{ (\mathbf x_i, y_i)\}_{i=1}^n where xXRd\mathbf x \in \mathcal X \subseteq \reals^d and yYy \in \mathcal Y. Consider the underlying data generation process f:XYf : \mathcal X \rightarrow \mathcal Y has the following form

f(x)=wTϕ(x),f (\mathbf x) = \mathbf w^T \phi(\mathbf x),

where ϕ:XRh\phi: \mathcal X \rightarrow \reals^h is a feature extractor that is assumed to be given. Consider regression problems, i.e. YR\mathcal Y \subseteq \Reals, we assume that the observed target is corrupted by noise ϵ\epsilon . That is

y=f(x)+ϵ,y = f(\mathbf x) + \epsilon,

where ϵN(0,σ2)\epsilon \sim \mathcal N(0, \sigma^2). In other words, p(yx,w)=N(wTϕ(x),σ2)p(y|\mathbf x, \mathbf w) = \mathcal {N} (\mathbf w^T\phi(\mathbf x), \sigma^2). Let define X=(x1,,xn)T\mathbf X = (\mathbf x_1, \dots, \mathbf x_n)^T and y=(y1,,yn)T\mathbf y = (y_1, \dots, y_n)^T. Using Bayes' rule, we have

p(wD)p(yX,w)p(w)(i=1np(yixi,w)p(w).\begin{aligned} p(\mathbf w | \mathcal D ) & \propto p( \mathbf y | \mathbf X, \mathbf w) p(\mathbf w) \\ & \propto \bigg(\prod_{i=1}^n p(y_i| \mathbf x_i, \mathbf w \bigg) p(\mathbf w). \end{aligned}

If we choose p(w)=N(0,σw2Ih)p(\mathbf w) = \mathcal N(0, \sigma_w^2I_h). This setup leads to ridge regression, and the predictive distribution has a Gaussian form.

Let's consider a KK-category classification problem, i.e. Y={0,1}K\mathcal Y = \{0, 1\}^K and assume the underlying data generation process is

fW(x)=softmax(Wϕ(x)),f_W(\mathbf x) = \text{softmax}(W \phi(\mathbf x)),

where WRK×hW \in \reals^{K \times h} and softmax:RK[0,1]K\text{softmax}: \reals^K \rightarrow [0, 1]^K whose exact form is

(softmax(a))k=exp(fWk(x))k=1Kexp(fWk(x)).(\text{softmax}(\mathbf a))_k = \frac{\exp( f_W^{k}(\mathbf x))}{\sum_{k'=1}^K \exp(f_W^{k'}(\mathbf x))}.

Define Y=(y1,,yn)T\mathbf Y = (\mathbf y_1, \dots, \mathbf y_n)^T. Therefore, we can write the likelihood as

p(YX,w)=i=1nyiTfW(x).p(\mathbf Y| \mathbf X,\mathbf w) = \prod_{i=1}^n \mathbf y_i^T f_W(\mathbf x).

However, in this classification setting and the assumption of the prior remains Gaussian, the posterior cannot be found because the evidence becomes intractable:

p(YX)=p(YX,w)p(w)dw=(i=1nyiTfW(xi))p(w)dw.\begin{aligned} p(\mathbf Y| \mathbf X) &= \int p(\mathbf Y| \mathbf X,\mathbf w) p(\mathbf w) d\mathbf w \\ &= \int \bigg( \prod_{i=1}^n \mathbf y_i^T f_W(\mathbf x_i) \bigg) p(\mathbf w) d\mathbf w. \end{aligned}

In other words, the category distribution (the output of the softmax function) is not conjugate with the Gaussian prior.

Approximating the Posterior : Variational Inference

Instead of computing p(WD)p(W|\mathcal D) directly, we approxiate it using a variational distribution qθ(W)q_\theta(W). Typically, we use a simple distribution like Gaussian for qθq_\theta; in this case, θ={μVI,ΣVI}\theta = \{ \mu_\text{VI}, \Sigma_\text{VI}\}. We can then find θ\theta by minimizing the reverse Kullback-Leibler divergence

minθKL(qθ(W)p(WX,Y)).\min_\theta \text{KL}(q_\theta(W) \| p(W|\mathbf X, \mathbf Y)).

Expanding the KL, we get

KL(qθ(W)p(WX,Y))=qθ(W)logqθ(W)p(WX,Y)dW=qθ(W)logqθ(W)p(YX)p(YX,W)p(W)dW=logp(YX)+KL(qθ(W)p(W))Eqθ(W)[logp(YX,W)].\begin{aligned} \text{KL}(q_\theta(W) \| p(W|\mathbf X, \mathbf Y)) & = \int q_\theta(W) \log \frac{q_\theta(W)}{p(W | \mathbf X, \mathbf Y)} d W \\ & = \int q_\theta(W) \log \frac{q_\theta(W)p(\mathbf Y | \mathbf X)}{p(\mathbf Y| \mathbf X, W)p(W)} d W \\ &= \log p(\mathbf Y | \mathbf X) + \text{KL}(q_\theta(W) \| p(W)) - \mathbb{E}_{q_\theta(W)}\bigg[ \log p(\mathbf Y | \mathbf X, W) \bigg]. \end{aligned}

Thus, we have

logp(YX)=Eqθ(W)[logp(YX,W)]KL(qθ(W)p(W))+KL(qθ(W)p(WX,Y))Eqθ(W)[logp(YX,W)]KL(qθ(W)p(W)),LELBO\begin{aligned} \log p(\mathbf Y | \mathbf X) &= \mathbb{E}_{q_\theta(W)}\bigg[ \log p(\mathbf Y | \mathbf X, W) \bigg] - \text{KL}(q_\theta(W) \| p(W)) + \text{KL}(q_\theta(W) \| p(W|\mathbf X, \mathbf Y)) \\ &\ge \underbrace{ \mathbb{E}_{q_\theta(W)}\bigg[ \log p(\mathbf Y | \mathbf X, W) \bigg] - \text{KL}(q_\theta(W) \| p(W)),}_{\mathcal L_\text{ELBO}} \end{aligned}

where the first term corresponds to how we predict the data, and the second term is how well our approximated distribution aligns with the prior.

Stochastic Approximate Inference

Let's consider the individual likelihood in classification settings; the log likelihood is

logp(y=ckx,W)=fWk(x)log(kexp(fWk(x))).\log p(y=c_k| \mathbf x, W) = f_W^{k}(\mathbf x) - \log \bigg(\sum_{k'} \exp(f_W^{k'}(\mathbf x))\bigg).

Then, we have

LELBO=(1ni=1n[fWk(xi)yilog(kexp(fWk(xi)))]qθ(W)dW)KL(qθ(W)p(W)),\begin{aligned} \mathcal L_\text{ELBO} = \bigg(\frac{1}{n} \sum_{i=1}^n \int \bigg[ f_W^{k}(\mathbf x_i)_{y_i} - \log \bigg(\sum_{k'} \exp(f_W^{k'}(\mathbf x_i))\bigg)\bigg] q_\theta(W) dW \bigg) - \text{KL}(q_\theta(W) \| p(W)), \end{aligned}

which is not integratable.

Detour: Monte Carlo Estimation and Re-parameterization Trick

Consider a density model p(x)p(x) and a function of interest f:XRf: \mathcal X \rightarrow \reals. We assume that xp(x)x \sim p(x) can be done easily and assume that E[f(X)]\mathbb E[ f(X)] is difficult to compute (no analytical solution). Instead, one can sample xip(x)x_i \sim p(x) for i=[1,n]i = [1, n] and compute

E^[f(X)]=1ni=1nf(xi).\mathbb{\hat{E}} [ f(X)] = \frac{1}{n} \sum_{i=1}^n f(x_i).

It can be shown that E^[f(X)]\mathbb{\hat{E}} [ f(X)] is an unbiased estimator, i.e.

limnE^[f(X)]=E[f(X)].\lim_{n \rightarrow \infty} \mathbb{\hat{E}} [ f(X)] = \mathbb{{E}} [ f(X)].

Let's assume wN(μ,σ2)w \sim \mathcal N(\mu, \sigma^2) and consider the following function

L(μ,σ)=(w+w2)1σ2πexp((wμ)22σ2)dw,\mathcal L(\mu, \sigma) = \int (w+w^2) \frac{1}{\sigma \sqrt{2 \pi } } \exp\bigg( - \frac{(w - \mu)^2}{2\sigma^2}\bigg) dw,

which has a closed-form solution (Link to derivation):

L(μ,σ)=σ2+μ+μ2.\mathcal L(\mu, \sigma) = \sigma^2 + \mu + \mu^2.

We can see that Lμ=1+2μ\frac{\partial \mathcal L }{\partial \mu } = 1 + 2 \mu and Lσ=2σ\frac{\partial \mathcal L }{\partial \sigma } = 2 \sigma.

However, if we take w^N(μ,σ2)\hat w \sim \mathcal N(\mu, \sigma^2) and compute the derivatives, we get μ(w^+w^2)=σ(w^+w^2)=0\frac{ \partial }{\partial \mu } (\hat w + \hat w^2) = \frac{ \partial }{\partial \sigma } (\hat w + \hat w^2) = 0. This is because the dependency of μ\mu and σ\sigma is via ww.

Instead, we can re-parameterize w^\hat w using

w^=μ+σϵ,\hat w = \mu + \sigma \epsilon,

where ϵN(0,1)\epsilon \sim \mathcal N(0, 1). Thus, we have

L^(μ,σ)=w^+w^2=(μ+σϵ)+(μ+σϵ)2.\begin{aligned} \mathcal{ \hat L}(\mu, \sigma) &= \hat w + \hat w^2 \\ &= (\mu + \sigma \epsilon) + (\mu + \sigma \epsilon)^2. \end{aligned}

We can see that

Ep(ϵ)[L^μ]=1+2(μ+σEp(ϵ)[ϵ])=1+2μEp(ϵ)[L^σ]=Ep(ϵ)[ϵ+2(μ+σϵ)(ϵ)]=2μEp(ϵ)[ϵ]+2σEp(ϵ)[ϵ2]=1=2σ,\begin{aligned} \mathbb{E}_{p(\epsilon)}\bigg [ \frac{\partial \hat \mathcal L }{\partial \mu } \bigg] &= 1 + 2(\mu + \cancel{\sigma\mathbb{E}_{p(\epsilon)} [\epsilon])} \\ &= 1 + 2\mu \\ \mathbb{E}_{p(\epsilon)}\bigg [ \frac{\partial \hat \mathcal L }{\partial \sigma } \bigg] &= \mathbb{E}_{p(\epsilon)} [\epsilon + 2 (\mu + \sigma \epsilon) (\epsilon) ] \\ &= \cancel{2\mu \mathbb{E}_{p(\epsilon)} [\epsilon ] }+ 2 \sigma \underbrace{\mathbb{E}_{p(\epsilon)} [\epsilon^2 ]}_{=1} \\ &= 2 \sigma, \end{aligned}

which are unbiased estimators of what we have derived previously. This trick is also known as pathwise derivative estimator, infinitesimal perturbation analysis, and stochastic backpropagation.

Approximating the Posterior with Stochastic Variational Inference

With the re-parameterization trick, we can then construct an objective function to learn qθ(W)q_\theta(W) with θ={μVI,ΣVI}\theta = \{ \mu_\text{VI}, \Sigma_\text{VI} \}. We assume ϵN(0,IKh)\epsilon \sim \mathcal {N}(0, I_{Kh}) and qθ=N(μVI,ΣVI)q_\theta = \mathcal N(\mu_\text{VI}, \Sigma_\text{VI}) where

  • μRKh\mu \in \reals^{Kh} and
  • ΣVI=diag(σVI12,,σVIKh2)RKh×Kh\Sigma_\text{VI} = \text{diag}(\sigma_{\text{VI}_1}^2, \dots, \sigma_{\text{VI}_{Kh}}^2) \in \reals^{Kh\times Kh}.

In other words, we have W^=μVI+ΣVI1/2ϵ\hat W = \mu_\text{VI} + \Sigma_{\text{VI}}^{1/2} \epsilon. Therefore, our learning objective is

L^ELBO=(1ni=1n[fWk,ϵ(xi)yilog(kexp(fWk,ϵ(xi)))]qθ(W)dW)KL(qθ(W)p(W)),\begin{aligned} \mathcal{\hat{L}}_\text{ELBO} = \bigg(\frac{1}{n} \sum_{i=1}^n \int \bigg[ f_W^{k, \epsilon}(\mathbf x_i)_{y_i} - \log \bigg(\sum_{k'} \exp(f_W^{k',\epsilon}(\mathbf x_i))\bigg)\bigg] q_\theta(W) dW \bigg) - \text{KL}(q_\theta(W) \| p(W)), \end{aligned}

whose Ep(ϵ)[L^ELBO]=LELBO\mathbb {E}_{p(\epsilon)}[ \mathcal{\hat{L}}_\text{ELBO} ] = \mathcal{{L}}_\text{ELBO} and Ep(ϵ)[θL^ELBO]=θLELBO\mathbb {E}_{p(\epsilon)}[ \nabla_\theta \mathcal{\hat{L}}_\text{ELBO} ] = \nabla_\theta \mathcal{{L}}_\text{ELBO} .

Uncertainty in Classification

For a multinomial distribution with KK classes whose mass is pckp_{c_k} for k[1,K]k \in [1, K], the uncertainty of the distribution is indicated by the entropy HH

H({pck}k)=k=1Kpcklogpck.H(\{p_{c_k}\}_k) = - \sum_{k=1}^K p_{c_k} \log p_{c_k}.

One observation is that the entropy is highest when pc=pcc,c[1,K]p_c = p_{c'} \forall c, c' \in [1, K]. In other words, this is the case when we have a uniform distribution over KK classes, indicating absolute ambiguity in the prediction. Noting that, here we use the natural logarithm; hence, HH is measured the unit of nats.

Because the output of neural networks for classification is a parameterized multinomial distribution, we can use the entropy as a measure of uncertainty, this is known as predictive entropy

Hp(yx,D)[Y]=k=1Kp(y=ckx,D)logp(y=ckx,D),H_{p(y_*|\mathbf x_*, \mathcal D)}[Y_*] = -\sum_{k=1}^K p(y_*=c_k|\mathbf x_*, \mathcal D) \log p(y_*=c_k|\mathbf x_*, \mathcal D),

where p(y=ckx,D)p(y_*=c_k|\mathbf x_*, \mathcal D) can be approximated using Monte Carlo. More precisely, let W^tqθ(W)\hat W_t \sim q_\theta(W) for t=[1,T]t=[1, T], we have

p(y=ckx,D)1Tt=1Tsoftmax(fW^t(x))k.p(y_*=c_k|\mathbf x_*, \mathcal D) \approx \frac{1}{T} \sum_{t=1}^T \text{softmax}(f_{\hat W_t}(\mathbf x_*)) _k.

Because the ambiguity comes from two sources: 1) noisy measurements and 2) model parameters, the predictive entropy captures both aleatoric and epistemic uncertainties. To compute the epistemic uncertainty, we can look at the mutual information between the predicted label and the weights

I[Y;W]=Hp(yx,D)[Y]Ep(WD)[Hp(yx,W)[Y]],I[Y_* ; W] = H_{p(y_*| \mathbf x_*, \mathcal D)}[Y_*] - { \color{blue} \mathbb E_{p(W|\mathcal D)}\big[ H_{p(y_*| \mathbf x_*, W)}[Y_*]\big] } ,

where the first term is the predictive entropy and the second term is the average uncertainty of the prediction w.r.t the posterior. In other words, the second term captures the epistemic uncertainty, and it can be computed by

Ep(WD)[Hp(yx,W)[Y]]=p(WD)Hp(yx,W)[Y]dW1Tt=1THp(yx,W^t)[Y]1Tt=1Tk=1Ksoftmax(fW^t(x))klogsoftmax(fW^t(x))k\begin{aligned} \mathbb E_{p(W|\mathcal D)}\big[ H_{p(y_*| \mathbf x_*, W)}[Y_*]\big] &= \int p(W|\mathcal D) H_{p(y_*| \mathbf x_*, W)}[Y_*] dW \\ &\approx \frac{1}{T} \sum_{t=1}^T H_{p(y_*| \mathbf x_*, \hat W_t)}[Y_*] \\ &\approx - \frac{1}{T} \sum_{t=1}^T \sum_{k=1}^K \text{softmax}(f_{\hat W_t}(\mathbf x_*)) _k \log \text{softmax}(f_{\hat W_t}(\mathbf x_*)) _k \\ \end{aligned}

where W^tqθ(W)\hat W_t \sim q_\theta(W). Moreover, because entropy is non-negative, we have the following condition

0I[Y;W]H[Y].0 \le I[Y_*; W] \le H[Y_*].

Stochastic Approximate Inference in Neural Networks

From the previous example, we only consider the posterior for the last layer; however, we can use the variational approach to relax this constraint by considering the following model

f(x)=W2σ(W1x+b¹)+b2,f(\mathbf x) = W^2\sigma(W^1 \mathbf x + b¹) + b^2,

where W1Rh×d,W2RK×hW^1 \in \reals^{h \times d}, W^2 \in \reals^{K \times h}, b1Rh,b2RKb^1 \in \reals^h, b^2 \in \reals^K . We then assume that wij1,wij2N(0,1)w_{ij}^1, w_{ij}^2 \sim \mathcal N(0, 1). Define W={W1,W2}\mathbf W = \{ W^{1}, W^{2}\}. We thus have

KL(qθp)=qθ(W)logqθ(W)p(W)dW=(1)qθW1(W1)qθW2(W2)logqθW1(W1)qθW2(W2)p(W1)p(W2)dW1dW2=qθW1(W1)(qθW2(W2)[logqθW1(W1)p(W1)+logqθW2(W2)p(W2)]dW2)dW1=qθW1(W1)[logqθW1(W1)p(W1)+KL(qθW2(W2)p(W2))]dW1=KL(qθW1(W1)p(W1))+KL(qθW2(W2)p(W2)),\begin{aligned} \text{KL}(q_\theta \| p) &= \int q_\theta(\mathbf W) \log \frac{q_\theta(\mathbf W)}{p(\mathbf W)} d\mathbf W \\ &\overset{(1)}{=} \int q_{\theta_{W^1}}(W^1) q_{\theta_{W^2}}(W^2) \log \frac{q_{\theta_{W^1}}(W^1) q_{\theta_{W^2}}(W^2)}{p(W^1) p(W^2)} d W^1 d W^2 \\ &= \int q_{\theta_{W^1}}(W^1) \bigg( \int q_{\theta_{W^2}}(W^2) \bigg[ \log \frac{q_{\theta_{W^1}}(W^1) }{p(W^1)} + \log \frac{q_{\theta_{W^2}}(W^2) }{p(W^2)} \bigg] d W^2 \bigg)dW^1 \\ &= \int q_{\theta_{W^1}}(W^1) \bigg[\log \frac{q_{\theta_{W^1}}(W^1) }{p(W^1)} +\text{KL}( q_{\theta_{W^2}}(W^2) \| p(W^2)) \bigg] d W^1 \\ &= \text{KL}( q_{\theta_{W^1}}(W^1) \| p(W^1)) +\text{KL}( q_{\theta_{W^2}}(W^2) \| p(W^2)), \end{aligned}

where (1) we assume that W1W^1 and W2W^2 are independent; this is known as mean-field variational inference. However, we see that finding qθ(W)q_\theta(\mathbf W) requires a doubled number of parameters (each ww's distribution is mean and variance); hence, the approach is not suitable for big models.

Fig. 1: Uncertainty from Bayesian Logistic Regression and (1 hidden layer) Neural Network; PyMC3's ADVI is used for the inference.

Dropout as Approximation Inference

Dropout is one of the regularization technique used in deep learning. In standard settings, during training, we randomly set activations in the network with probability ρ\rho, i.e. the dropout activity is Ber(ρ)\text{Ber}(\rho). Then, during test time, we turn off the dropout activity and multiply each neuron with 11ρ\frac{1}{1 -\rho} to compensate the dropout activity. These two steps are called stochastic and deterministic forward passes respectively.

Let's take a closer look into this using the previous model. Define ϵij1Ber(1ρ1),ϵij2Ber(1ρ2)\epsilon^1_{ij} \sim \text{Ber}(1-\rho^1), \epsilon^2_{ij} \sim \text{Ber}(1-\rho^2) and the parameters are M1Rh×d,M2RK×h,b1Rd,b2RKM^1 \in \reals^{h \times d}, M^2 \in \reals^{K \times h}, b^1 \in \reals^d, b^2 \in \reals^K, We have

f(x)=M2[ϵ2σ(M1(ϵ1x)+b¹)]+b2.f(\mathbf x) = M^2 \bigg[{ \epsilon^2}\sigma(M^1 (\epsilon^1 \mathbf x ) + b¹) \bigg] + b^2.

We can see that we can write W^1=M1ϵ1\hat W^{1} = M^1 \epsilon^1 and W^2=M2ϵ2\hat W^{2} = M^2 \epsilon^2. Gal (2016) shows that the KL term can be approximated

KL(qθp)1ρ12σw2M122+hd(ρ1)+1ρ22σw2M222+Kh(ρ2)+C,\text{KL}(q_\theta \| p) \approx \frac{1-\rho^1}{2\sigma_w^2}\| M^1 \|_2^2 + hd(\rho^1) + \frac{1-\rho^2}{2\sigma_w^2}\| M^2 \|_2^2 + Kh(\rho^2) + C,

where CC is a constant. Define ω^ϵ={W^1,W^2}\hat \omega_\epsilon = \{\hat W^1, \hat W^2\}. Therefore, the loss function is

LELBO-Dropout=1nilogp(yixi,ω^ϵ)1ρ12σw2M1221ρ22σw2M222.\mathcal L_\text{ELBO-Dropout} {=} \frac{1}{n} \sum_{i} \log p(\mathbf y_i| \mathbf x_i, \hat \omega_\epsilon) - \frac{1-\rho^1}{2\sigma_w^2}\| M^1 \|_2^2 - \frac{1-\rho^2}{2\sigma_w^2}\| M^2 \|_2^2.

Here, it should be noted that ρ1\rho^1 and ρ2\rho^2 also have to chosen properly; for example, one might consider using cross-validation.


Prediction is only part of the whole story. In the real world, we also need to know when our predictive models are uncertain; this is critical in high stake applications, such as automous vehicles and heathcare. Recent development in variational inference and sampling methods allow us to approximate the posterior, hence enabling extracting uncertainty from the model.

Figures are maded from Google Colab.