Consider a dataset D={(xi,yi)}i=1n where x∈X⊆Rd and y∈Y. Consider the underlying data generation process f:X→Y has the following form
f(x)=wTϕ(x),
where ϕ:X→Rh is a feature extractor that is assumed to be given. Consider regression problems, i.e. Y⊆R, we assume that the observed target is corrupted by noise ϵ . That is
y=f(x)+ϵ,
where ϵ∼N(0,σ2). In other words, p(y∣x,w)=N(wTϕ(x),σ2). Let define X=(x1,…,xn)T and y=(y1,…,yn)T. Using Bayes' rule, we have
p(w∣D)∝p(y∣X,w)p(w)∝(i=1∏np(yi∣xi,w)p(w).
If we choose p(w)=N(0,σw2Ih). This setup leads to ridge regression, and the predictive distribution has a Gaussian form.
Let's consider a K-category classification problem, i.e. Y={0,1}K and assume the underlying data generation process is
fW(x)=softmax(Wϕ(x)),
where W∈RK×h and softmax:RK→[0,1]K whose exact form is
(softmax(a))k=∑k′=1Kexp(fWk′(x))exp(fWk(x)).
Define Y=(y1,…,yn)T. Therefore, we can write the likelihood as
p(Y∣X,w)=i=1∏nyiTfW(x).
However, in this classification setting and the assumption of the prior remains Gaussian, the posterior cannot be found because the evidence becomes intractable:
In other words, the category distribution (the output of the softmax function) is not conjugate with the Gaussian prior.
Approximating the Posterior : Variational Inference
Instead of computing p(W∣D) directly, we approxiate it using a variational distribution qθ(W). Typically, we use a simple distribution like Gaussian for qθ; in this case, θ={μVI,ΣVI}. We can then find θ by minimizing the reverse Kullback-Leibler divergence
Detour: Monte Carlo Estimation and Re-parameterization Trick
Consider a density model p(x) and a function of interest f:X→R. We assume that x∼p(x) can be done easily and assume that E[f(X)] is difficult to compute (no analytical solution). Instead, one can sample xi∼p(x) for i=[1,n] and compute
E^[f(X)]=n1i=1∑nf(xi).
It can be shown that E^[f(X)] is an unbiased estimator, i.e.
n→∞limE^[f(X)]=E[f(X)].
Let's assume w∼N(μ,σ2) and consider the following function
which are unbiased estimators of what we have derived previously. This trick is also known as pathwise derivative estimator, infinitesimal perturbation analysis, and stochastic backpropagation.
Approximating the Posterior with Stochastic Variational Inference
With the re-parameterization trick, we can then construct an objective function to learn qθ(W) with θ={μVI,ΣVI}. We assume ϵ∼N(0,IKh) and qθ=N(μVI,ΣVI) where
μ∈RKh and
ΣVI=diag(σVI12,…,σVIKh2)∈RKh×Kh.
In other words, we have W^=μVI+ΣVI1/2ϵ. Therefore, our learning objective is
whose Ep(ϵ)[L^ELBO]=LELBO and Ep(ϵ)[∇θL^ELBO]=∇θLELBO .
Uncertainty in Classification
For a multinomial distribution with K classes whose mass is pck for k∈[1,K], the uncertainty of the distribution is indicated by the entropy H
H({pck}k)=−k=1∑Kpcklogpck.
One observation is that the entropy is highest when pc=pc′∀c,c′∈[1,K]. In other words, this is the case when we have a uniform distribution over K classes, indicating absolute ambiguity in the prediction. Noting that, here we use the natural logarithm; hence, H is measured the unit of nats.
Because the output of neural networks for classification is a parameterized multinomial distribution, we can use the entropy as a measure of uncertainty, this is known as predictive entropy
where p(y∗=ck∣x∗,D) can be approximated using Monte Carlo. More precisely, let W^t∼qθ(W) for t=[1,T], we have
p(y∗=ck∣x∗,D)≈T1t=1∑Tsoftmax(fW^t(x∗))k.
Because the ambiguity comes from two sources: 1) noisy measurements and 2) model parameters, the predictive entropy captures both aleatoric and epistemic uncertainties. To compute the epistemic uncertainty, we can look at the mutual information between the predicted label and the weights
where the first term is the predictive entropy and the second term is the average uncertainty of the prediction w.r.t the posterior. In other words, the second term captures the epistemic uncertainty, and it can be computed by
where W^t∼qθ(W). Moreover, because entropy is non-negative, we have the following condition
0≤I[Y∗;W]≤H[Y∗].
Stochastic Approximate Inference in Neural Networks
From the previous example, we only consider the posterior for the last layer; however, we can use the variational approach to relax this constraint by considering the following model
f(x)=W2σ(W1x+b¹)+b2,
where W1∈Rh×d,W2∈RK×h, b1∈Rh,b2∈RK . We then assume that wij1,wij2∼N(0,1). Define W={W1,W2}. We thus have
where (1) we assume that W1 and W2 are independent; this is known as mean-field variational inference. However, we see that finding qθ(W) requires a doubled number of parameters (each w's distribution is mean and variance); hence, the approach is not suitable for big models.
Fig. 1: Uncertainty from Bayesian Logistic Regression and (1 hidden layer) Neural Network; PyMC3's ADVI is used for the inference.
Dropout as Approximation Inference
Dropout is one of the regularization technique used in deep learning. In standard settings, during training, we randomly set activations in the network with probability ρ, i.e. the dropout activity is Ber(ρ). Then, during test time, we turn off the dropout activity and multiply each neuron with 1−ρ1 to compensate the dropout activity. These two steps are called stochastic and deterministic forward passes respectively.
Let's take a closer look into this using the previous model. Define ϵij1∼Ber(1−ρ1),ϵij2∼Ber(1−ρ2) and the parameters are M1∈Rh×d,M2∈RK×h,b1∈Rd,b2∈RK, We have
f(x)=M2[ϵ2σ(M1(ϵ1x)+b¹)]+b2.
We can see that we can write W^1=M1ϵ1 and W^2=M2ϵ2. Gal (2016) shows that the KL term can be approximated
Here, it should be noted that ρ1 and ρ2 also have to chosen properly; for example, one might consider using cross-validation.
Conclusion
Prediction is only part of the whole story. In the real world, we also need to know when our predictive models are uncertain; this is critical in high stake applications, such as automous vehicles and heathcare. Recent development in variational inference and sampling methods allow us to approximate the posterior, hence enabling extracting uncertainty from the model.