Consider a dataset D={(xi,yi)}i=1n. Let X∈Rn×d and y∈Rn. Our goal is to learn w∈Rd s.t.
w^←wargminn1i=1∑n(Xw−y)T(Xw−y).
In fact, this mean-squared error comes from the fact that we assume
y=xTw∗+ϵ,
where ϵ∼N(0,σ2) and w∗ is the oracle weight. With this assumption, we can see that
y∼N(xTw,σ2).
Thus, we have
p(y∣x,w)=Zx1exp(−2σ2(y−wTx)2),
where Zx is the normalizer of the Gaussian distribution. If we assume that (xi,yi)'s are independent and identically distributed, we have
p(D∣w)=i=1∏np(yi∣xiw).
Taking the log and negate the term yields the objective we have just mentioned. The approach based on optimizing p(D∣w) is called maximum likelihood estimation, which yields a point estimate w^.
Posterior Distribution
However, in many applications, we just do not want only w^ but rather the distribution of w given the data to estimate (epstemic) uncertainty of the prediction. This distribution is called the posterior distribution, and it can be computed via
p(w∣D)=p(D)p(D∣w)p(w).
This is essentially the Bayes' rule. One can see that p(D)=∫p(D∣w)p(w)dw can be hard to compute; however, this can be done (analytically) with proper assumptions. In particular, we might choose p(w) that is structurally comparable with p(D∣w); noting that the technical term is conjugate. For example, if our p(y∣x,w) is a Gaussian, we might assume that p(w)=N(0,Σw), where Σw=σw2Id. With this choice of prior, we have
where C is a constant when marginalizing out w . From this, the posterior distribution is in a multivariate Gaussian whose covariance is
Σ′=(σ21XTX+Σw−1)−1.
When expanding the density of multivariate Gaussian N(μ′,Σ′), we know that
Σ′−1μ′μ′=σ2XTy=σ2Σ′XTy.
Therefore, we can conclude that
p(w∣D)=N(μ′σ2Σ′XTy,Σ′(σ21XTX+Σw−1)−1).
Predictive Mean and Variance
Consider a new sample x∗. We would like to know its prediction according to the posterior (i.e. averaging across all possible w) and its variance which is the uncertainty of the prediction. In particular, we know that
where (†) is based on the assumptions that (1) the new prediction and the data D given the model's parameters and (2) our model's parameters are independent of the new sample x∗ given the data D. Because both terms in the integral are Gaussian, this distribution of the prediction is also a Gaussian. Let assume p(y∗∣x∗,D)=N(μ∗,σ∗2). Writing the two term together, we get
These are the predictive variance and predictive mean for logistic regression with Gaussian prior or ridge regression.
We can look closer at σy∗2. Here the first term σ2 is a constrant we assume; more precisely it tells us about aleatoric uncertainty, which is the uncertainy due to noise in measurement. On the other hand, the second term x∗TΣ′x∗ is what we are interested in if we make prediction. It captures epistemic uncertainty, which indicates the level of knowledge one does not have in the problem or the model. Therefore, if one is interested in the uncertainty of her/his model f's prediction, one can determine the uncertainy by
Var(f)=x∗TΣ′x∗.
Example
Now, it is time to put things together. We take a dataset and train a linear regression model on four different subsets. We assume that all train samples are in the range [−1,1], while test samples are [−1.5,1.5].
Fig. 1: Ridge regression trained with data with different sizes; the more training data the more certain prediction it is, especially in the extrapolation regime.
From Fig. 1, we see the effect of extrapolation in the range that our training data does not cover. Without the posterior distribution, we would not know how much uncertainty we had if we relied the point-estimate of the solution (i.e. solutions of ML or MAP). Nevertheless, as we have more training samples, our model does not only do well in the interpolation regime (where training data is covered) but also the extrapolation regime.
Conclusion and References
Perhaps, in the next step, we shall look at classification tasks and how to estimate uncertainty of the prediction in such situations.