This article is a personal summary of Y. Gal's MLSS 2019 Moscow, Bayesian Deep Learning (Slide Deck 2).
Consider a dataset where and . Consider the underlying data generation process has the following form
where is a feature extractor that is assumed to be given. Consider regression problems, i.e. , we assume that the observed target is corrupted by noise . That is
where . In other words, . Let define and . Using Bayes' rule, we have
If we choose . This setup leads to ridge regression, and the predictive distribution has a Gaussian form.
Let's consider a -category classification problem, i.e. and assume the underlying data generation process is
where and whose exact form is
Define . Therefore, we can write the likelihood as
However, in this classification setting and the assumption of the prior remains Gaussian, the posterior cannot be found because the evidence becomes intractable:
In other words, the category distribution (the output of the softmax function) is not conjugate with the Gaussian prior.
Approximating the Posterior : Variational Inference
Instead of computing directly, we approxiate it using a variational distribution . Typically, we use a simple distribution like Gaussian for ; in this case, . We can then find by minimizing the reverse Kullback-Leibler divergence
Expanding the KL, we get
Thus, we have
where the first term corresponds to how we predict the data, and the second term is how well our approximated distribution aligns with the prior.
Stochastic Approximate Inference
Let's consider the individual likelihood in classification settings; the log likelihood is
Then, we have
which is not integratable.
Detour: Monte Carlo Estimation and Re-parameterization Trick
Consider a density model and a function of interest . We assume that can be done easily and assume that is difficult to compute (no analytical solution). Instead, one can sample for and compute
It can be shown that is an unbiased estimator, i.e.
Let's assume and consider the following function
which has a closed-form solution (Link to derivation):
We can see that and .
However, if we take and compute the derivatives, we get . This is because the dependency of and is via .
Instead, we can re-parameterize using
where . Thus, we have
We can see that
which are unbiased estimators of what we have derived previously. This trick is also known as pathwise derivative estimator, infinitesimal perturbation analysis, and stochastic backpropagation.
Approximating the Posterior with Stochastic Variational Inference
With the re-parameterization trick, we can then construct an objective function to learn with . We assume and where
In other words, we have . Therefore, our learning objective is
whose and .
Uncertainty in Classification
For a multinomial distribution with classes whose mass is for , the uncertainty of the distribution is indicated by the entropy
One observation is that the entropy is highest when . In other words, this is the case when we have a uniform distribution over classes, indicating absolute ambiguity in the prediction. Noting that, here we use the natural logarithm; hence, is measured the unit of
Because the output of neural networks for classification is a parameterized multinomial distribution, we can use the entropy as a measure of uncertainty, this is known as predictive entropy
where can be approximated using Monte Carlo. More precisely, let for , we have
Because the ambiguity comes from two sources: 1) noisy measurements and 2) model parameters, the predictive entropy captures both aleatoric and epistemic uncertainties. To compute the epistemic uncertainty, we can look at the mutual information between the predicted label and the weights
where the first term is the predictive entropy and the second term is the average uncertainty of the prediction w.r.t the posterior. In other words, the second term captures the epistemic uncertainty, and it can be computed by
where . Moreover, because entropy is non-negative, we have the following condition
Stochastic Approximate Inference in Neural Networks
From the previous example, we only consider the posterior for the last layer; however, we can use the variational approach to relax this constraint by considering the following model
where , . We then assume that . Define . We thus have
where (1) we assume that and are independent; this is known as mean-field variational inference. However, we see that finding requires a doubled number of parameters (each 's distribution is mean and variance); hence, the approach is not suitable for big models.
Fig. 1: Uncertainty from Bayesian Logistic Regression and (1 hidden layer) Neural Network; PyMC3's ADVI is used for the inference.
Dropout as Approximation Inference
Dropout is one of the regularization technique used in deep learning. In standard settings, during training, we randomly set activations in the network with probability , i.e. the dropout activity is . Then, during test time, we turn off the dropout activity and multiply each neuron with to compensate the dropout activity. These two steps are called stochastic and deterministic forward passes respectively.
Let's take a closer look into this using the previous model. Define and the parameters are , We have
We can see that we can write and . Gal (2016) shows that the KL term can be approximated
where is a constant. Define . Therefore, the loss function is
Here, it should be noted that and also have to chosen properly; for example, one might consider using cross-validation.
Prediction is only part of the whole story. In the real world, we also need to know when our predictive models are uncertain; this is critical in high stake applications, such as automous vehicles and heathcare. Recent development in variational inference and sampling methods allow us to approximate the posterior, hence enabling extracting uncertainty from the model.
Figures are maded from Google Colab.