Bayesian Linear Regression

February 12, 2021

Linear Regression

Consider a dataset D={(xi,yi)}i=1n\mathcal{D} = \{ ( \mathbf x_i, y_i ) \}_{i=1}^n. Let XRn×d\mathbf X \in \reals^{n \times d } and yRn\mathbf y \in \reals^n. Our goal is to learn wRd\mathbf w \in \reals^d s.t.

w^arg minw1ni=1n(Xwy)T(Xwy).\mathbf{\hat{w}} \leftarrow \argmin_{\mathbf w} \frac{1}{n} \sum_{i=1}^n (\mathbf X \mathbf w - \mathbf y)^T (\mathbf X \mathbf w - \mathbf y).

In fact, this mean-squared error comes from the fact that we assume

y=xTw+ϵ,y = \mathbf x^T \mathbf w^* + \epsilon,

where ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2) and w\mathbf w^* is the oracle weight. With this assumption, we can see that

yN(xTw,σ2).y \sim \mathcal{N}(\mathbf x^T \mathbf w, \sigma^2).

Thus, we have

p(yx,w)=1Zxexp((ywTx)22σ2),p(y|\mathbf x, \mathbf w) = \frac{1}{Z_x} \exp\bigg( - \frac{ (y - \mathbf w^T \mathbf x )^2}{2\sigma^2}\bigg),

where ZxZ_x is the normalizer of the Gaussian distribution. If we assume that (xi,yi)(\mathbf x_i, y_i)'s are independent and identically distributed, we have

p(Dw)=i=1np(yixiw).p(\mathcal D | \mathbf w) = \prod_{i=1}^n p(y_i | \mathbf x_i \mathbf w).

Taking the log and negate the term yields the objective we have just mentioned. The approach based on optimizing p(Dw)p(\mathcal D | \mathbf w) is called maximum likelihood estimation, which yields a point estimate w^\mathbf{\hat{w}}.

Posterior Distribution

However, in many applications, we just do not want only w^\mathbf{\hat{w}} but rather the distribution of w\mathbf w given the data to estimate (epstemic) uncertainty of the prediction. This distribution is called the posterior distribution, and it can be computed via

p(wD)=p(Dw)p(w)p(D).p(\mathbf w | \mathcal D ) = \frac{ p(\mathcal D | \mathbf w) p(\mathbf w)}{p(\mathcal D)}.

This is essentially the Bayes' rule. One can see that p(D)=p(Dw)p(w)dwp(\mathcal D ) = \int p(\mathcal D| \mathbf w) p(\mathbf w) d\mathbf w can be hard to compute; however, this can be done (analytically) with proper assumptions. In particular, we might choose p(w)p(\mathbf w) that is structurally comparable with p(Dw)p(\mathcal D | \mathbf w); noting that the technical term is conjugate. For example, if our p(yx,w)p(y|\mathbf x ,\mathbf w) is a Gaussian, we might assume that p(w)=N(0,Σw),p(\mathbf w ) = \mathcal{N} (0, \Sigma_w), where Σw=σw2Id\Sigma_w = \sigma_w^2I_d. With this choice of prior, we have

p(Dw)p(w)=i=1np(yixi,w)p(w)=[i=1n1Zxexp((yiwTxi)22σ2)]1Zwexp(wTΣw1w2)wexp(12σ2i=1n(wTxixiTw2yiwTxi+yi2)wTΣw1w2)wexp(12(wT(1σ2i=1nxixiT)w+wTΣw1w2σ2wTXTy)+C)wexp(12(wT(1σ2XTX+Σw1)w2σ2wTXTy)),\begin{aligned} p(\mathcal D | \mathbf w) p(\mathbf w) &= \prod_{i=1}^n p(y_i | \mathbf x_i, \mathbf w) p(\mathbf w) \\ &=\bigg [ \prod_{i=1}^n \frac{1}{Z_x} \exp\bigg( - \frac{(y_i - \mathbf w^T \mathbf x_i)^2}{2\sigma^2} \bigg) \bigg] \frac{1}{Z_w} \exp \bigg(\frac{ - \mathbf w^T \Sigma_w^{-1} \mathbf w }{2}\bigg) \\ &\propto_{\mathbf w}\exp\bigg( - \frac{1}{2\sigma^2} \sum_{i=1}^n (\mathbf w^T \mathbf x_i \mathbf x_i^T \mathbf w - 2y_i\mathbf w^T \mathbf x_i + y_i^2) - \frac{ \mathbf w^T \Sigma_w^{-1} \mathbf w }{2} \bigg) \\ &\propto_{\mathbf w} \exp\bigg( - \frac{1}{2} \bigg( \mathbf w^T \bigg( \frac{1}{\sigma^2} \sum_{i=1}^n \mathbf x_i \mathbf x_i^T \bigg) \mathbf w + \mathbf w^T \Sigma_w^{-1} \mathbf w - \frac{2}{\sigma^2} \mathbf w^T \mathbf X^T \mathbf y \bigg) + C\bigg) \\ &\propto_{\mathbf w} \exp\bigg( - \frac{1}{2} \bigg( \mathbf w^T \bigg( \frac{1}{\sigma^2} \mathbf X^T \mathbf X + \Sigma_w^{-1} \bigg) \mathbf w - \frac{2}{\sigma^2} \mathbf w^T \mathbf X^T \mathbf y \bigg) \bigg), \end{aligned}

where CC is a constant when marginalizing out w\mathbf w . From this, the posterior distribution is in a multivariate Gaussian whose covariance is

Σ=(1σ2XTX+Σw1)1.\Sigma' = \bigg( \frac{1}{\sigma^2} \mathbf X^T \mathbf X + \Sigma^{-1}_w \bigg)^{-1}.

When expanding the density of multivariate Gaussian N(μ,Σ)\mathcal{N}(\mu', \Sigma'), we know that

Σ1μ=XTyσ2μ=ΣXTyσ2.\begin{aligned} \Sigma'^{-1} \mu' &= \frac{\mathbf X^T \mathbf y}{\sigma^2} \\ \mu' &= \frac{\Sigma' \mathbf X^T y }{\sigma^2}. \end{aligned}

Therefore, we can conclude that

p(wD)=N(ΣXTyσ2μ,(1σ2XTX+Σw1)1Σ).p(\mathbf w | \mathcal D) = \mathcal{N}\bigg( \underbrace{ \frac{\Sigma' \mathbf X^T y }{\sigma^2}}_{\mu'}, \underbrace{\bigg( \frac{1}{\sigma^2} \mathbf X^T \mathbf X + \Sigma^{-1}_w \bigg)^{-1} }_{\Sigma'}\bigg).

Predictive Mean and Variance

Consider a new sample x\mathbf x^*. We would like to know its prediction according to the posterior (i.e. averaging across all possible w)\mathbf w ) and its variance which is the uncertainty of the prediction. In particular, we know that

p(yx,D)=p(y,wx,D)dw=p(yw,x,D)p(wx,D)dw=()p(yw,x)p(wD)dw,\begin{aligned} p(y | \mathbf x^*, \mathcal D) &= \int p(y, \mathbf w | \mathbf x^*, \mathcal D) d\mathbf w \\ &= \int p(y| \mathbf w , \mathbf x^*, \mathcal D) p(\mathbf w| \mathbf x^*, \mathcal D) d\mathbf w \\ &\overset{(\dagger)}{=} \int p(y| \mathbf w , \mathbf x^*) p(\mathbf w| \mathcal D) d\mathbf w, \end{aligned}

where ()(\dagger) is based on the assumptions that (1) the new prediction and the data D\mathcal D given the model's parameters and (2) our model's parameters are independent of the new sample x\mathbf x^* given the data D\mathcal D. Because both terms in the integral are Gaussian, this distribution of the prediction is also a Gaussian. Let assume p(yx,D)=N(μ,σ2)p(y_*|\mathbf x_*, \mathcal D) = \mathcal N (\mu_*, \sigma_*^2). Writing the two term together, we get

p(yx,D)=p(yw,x)p(wD)dwyexp((ywTx)22σ2)exp((wμ)TΣ1(wμ)2)dwyexp((ywTx)22σ2(wμ)TΣ1(wμ)2)dw.\begin{aligned} p(y_*|\mathbf x_*, \mathcal D) &= \int p(y|\mathbf w, \mathbf x_*) p(\mathbf w | \mathcal D) d\mathbf w \\ &\propto_y \int \exp \bigg( - \frac{(y_*-\mathbf w^T \mathbf x_*)^2}{2\sigma^2}\bigg) \exp\bigg( -\frac{(\mathbf w - \mu')^T\Sigma'^{-1}(\mathbf w - \mu')}{2} \bigg) d \mathbf w \\ &\propto_y \int \exp \bigg( - \frac{(y_* - \mathbf w^T \mathbf x_*)^2}{2\sigma^2} -\frac{(\mathbf w - \mu')^T\Sigma'^{-1}(\mathbf w - \mu')}{2} \bigg) d \mathbf w. \end{aligned}

Expanding the term inside the exponent yields

(ywTx)2=(y)22wTxy+(wTx)2(wμ)Σ1(wμ)=wTΣ1w2wTΣ1μ+μTΣ1μ.\begin{aligned} (y_* - \mathbf w^T \mathbf x)^2 &= (y_*)^2 - 2\mathbf w^T \mathbf x_* y_* + (\mathbf w^T \mathbf x_*)^2 \\ (\mathbf w - \mu')\Sigma'^{-1} (\mathbf w - \mu')&= \mathbf w^T \Sigma'^{-1} \mathbf w - 2\mathbf w^T \Sigma'^{-1} \mu' + \mu'^T \Sigma^{-1}\mu'. \end{aligned}

Thus, we have

p(yx,D)yexp(y22σ2)exp(12[wT(Σ1+xxTσ2=:L)w2wT(xyσ2+Σ1μ)])dw.\begin{aligned} p(y_* | \mathbf x_*, \mathcal D) \propto_y \exp\bigg( - \frac{y_*^2}{2\sigma^2} \bigg) \int \exp \bigg ( - \frac{1}{2} \bigg[ \mathbf w^T \bigg(\underbrace{ \Sigma'^{-1} + \frac{\mathbf x_* \mathbf x_*^T}{\sigma^2}}_{=:L} \bigg)\mathbf w - 2\mathbf w^T \bigg( \frac{\mathbf x_* y_*}{\sigma^2} +\Sigma'^{-1}\mu' \bigg) \bigg]\bigg) d\mathbf w. \end{aligned}

We can see that the exponent term inside the integrate is the form of Gaussian; in particular, this is

N(ξ:=L1(xyσ2+Σ1μ),L1),\mathcal{N}\bigg(\xi:=L^{-1}\bigg(\frac{\mathbf x_* y_*}{\sigma^2} + \Sigma'^{-1}\mu' \bigg), L^{-1}\bigg),

with an assumption that L1L^{-1} exists. Because of the Gaussian form, we thus have

p(yw,x)yexp(y22σ2+ξTLξ2)exp((wξ)TL(wξ)2)dw1yexp(y22σ2+ξTLξ2)yexp(12[y2σ2ξTLξ]).\begin{aligned} p(y_* | \mathbf w, \mathbf x_*) &\propto_y \exp \bigg(- \frac{y_*^2}{2\sigma^2 } + \frac{\xi^T L \xi }{2} \bigg) \underbrace{\int \exp\bigg( -\frac{(\mathbf w - \xi)^T L (\mathbf w - \xi) }{2} \bigg) d \mathbf w}_{1} \\ &\propto_y \exp \bigg(- \frac{y_*^2}{2\sigma^2 } + \frac{\xi^T L \xi }{2} \bigg) \\ &\propto_y \exp \bigg(- \frac{1}{2} \bigg [ \frac{y_*^2}{\sigma^2 } - \xi^T L \xi \bigg] \bigg) . \end{aligned}

Because L1L^{-1} is symmetric, i.e. L1=(L1)TL^{-1} = (L^{-1})^T, we can simplify ξTLξ\xi ^T L \xi , yielding

ξTLξ=[L1(xyσ2+Σ1μ)]TL[L1(xyσ2+Σ1μ)]=[L1(xyσ2+Σ1μ)]T(xyσ2+Σ1μ)=((L1x)Txσ4)y2+2((L1x)TΣ1μσ2)y+C=(xTL1xσ4)y2+2(xTL1Σ1μσ2)y+C.\begin{aligned} \xi^T L \xi &= \bigg[L^{-1}\bigg(\frac{\mathbf x_* y_*}{\sigma^2} + \Sigma'^{-1}\mu'\bigg) \bigg]^T L \bigg[L^{-1}\bigg(\frac{\mathbf x_* y_*}{\sigma^2} + \Sigma'^{-1}\mu'\bigg) \bigg] \\ &= \bigg[L^{-1}\bigg(\frac{\mathbf x_* y_*}{\sigma^2} + \Sigma'^{-1}\mu'\bigg) \bigg]^T \bigg(\frac{\mathbf x_* y_*}{\sigma^2} + \Sigma'^{-1}\mu'\bigg) \\ &= \bigg(\frac{(L^{-1} \mathbf x_*)^T \mathbf x_*}{\sigma^4} \bigg) y_*^2 + 2\bigg(\frac{(L^{-1}\mathbf x_*)^T \Sigma'^{-1}\mu' }{\sigma^2}\bigg) y_* + C \\ &= \bigg(\frac{\mathbf x_* ^T L^{-1} \mathbf x_*}{\sigma^4}\bigg)y_*^2 + 2 \bigg ( \frac{ \mathbf x_*^T L^{-1} \Sigma'^{-1} \mu '}{\sigma^2} \bigg)y_* + C. \end{aligned}

Combining all the results together we have

p(yw,x)yexp(12[1σ2(1xTL1xσ2)y22(xTL1Σ1μσ2)y])=N(μy,σy2),\begin{aligned} p(y_* | \mathbf w, \mathbf x_*) &\propto_y \exp \bigg( - \frac{1}{2} \bigg[ \frac{1}{\sigma^2} \bigg ( 1 - \frac{\mathbf x_*^T L^{-1} \mathbf x_*}{\sigma^2} \bigg) y_*^2 - 2\bigg ( \frac{\mathbf x_*^T L^{-1}\Sigma'^{-1}\mu'}{\sigma^2} \bigg ) y_*\bigg] \bigg) \\ &= \mathcal{N}(\mu_{y_*}, \sigma_{y_*}^2), \end{aligned}

where we have

σy2=[1σ2(1xTL1xσ2)]1μy=σy2(xTL1Σ1μσ2).\begin{aligned} \sigma^2_{y_*} &= \bigg [ \frac{1}{\sigma^2} \bigg( 1 - \frac{\mathbf x_*^T L^{-1} \mathbf x_*}{\sigma^2} \bigg) \bigg]^{-1} \\ \mu_{y_*} &= \sigma_{y_*}^2 \bigg( \frac{\mathbf x_*^T L^{-1}\Sigma'^{-1}\mu'}{\sigma^2} \bigg ). \end{aligned}

These results can be further simplified by using the Sherman–Morrison formula (1). Denote a=1/σ2a=1/\sigma^2. We have

xTL1x=xT(Σ1+axxT)1x=(1)xT(ΣaΣxxTΣ1+axTΣx)x=xTΣx(1axTΣx1+axTΣx)=xTΣx1+axTΣx.\begin{aligned} \mathbf x_*^T L^{-1} \mathbf x_* &= \mathbf x_*^T \bigg( \Sigma'^{-1} + a \mathbf x_* \mathbf x_*^T \bigg)^{-1} \mathbf x_* \\ &\overset{(1)}{=} \mathbf x_*^T \bigg( \Sigma' - \frac{a \Sigma' \mathbf x_* \mathbf x_*^T \Sigma'}{1+a \mathbf x_*^T\Sigma'\mathbf x_*} \bigg) \mathbf x_* \\ &= \mathbf x_*^T \Sigma' \mathbf x_* \bigg ( 1 - \frac{a\mathbf x_*^T \Sigma' \mathbf x_*}{1+a\mathbf x_*^T \Sigma' \mathbf x_*} \bigg) \\ &= \frac{ \mathbf x_*^T \Sigma' \mathbf x_* }{1+a\mathbf x_*^T \Sigma' \mathbf x_*}. \end{aligned}


σy2=[a(1axTΣx1+axTΣx)]1=[a1+axTΣx]1=1a+xTΣx=σ2+xTΣx.\begin{aligned} \sigma_{y_*}^2 &= \bigg[ a \bigg( 1 - \frac{a\mathbf x_*^T \Sigma' \mathbf x_*}{1+a\mathbf x_*^T\Sigma' \mathbf x_*} \bigg) \bigg]^{-1} \\ &= \bigg [ \frac{a}{1+a\mathbf x_*^T \Sigma' \mathbf x_*} \bigg]^{-1} \\ &= \frac{1}{a} + \mathbf x_*^T \Sigma' \mathbf x_* \\ &= \sigma^2 + \mathbf x_*^T \Sigma' \mathbf x_*. \end{aligned}

Similary, we can follow the same steps for μy\mu_{y_*},

μy=σy2(axTL1Σ1μ)=σy2(axT(ΣaΣxxTΣ1+axTΣx)Σ1μ)=σy2a(xTμ(11+axΣx))=σy2(xTμ(11a+xΣx))=xTμ.\begin{aligned} \mu_{y_*} &= \sigma_{y_*}^2 \bigg( a \mathbf x_*^T L^{-1}\Sigma'^{-1}\mu' \bigg ) \\ &= \sigma_{y_*}^2 \bigg( a \mathbf x_*^T \bigg( \Sigma' - \frac{ a\Sigma'\mathbf x_*\mathbf x_*^T \Sigma' }{1+a\mathbf x_*^T \Sigma' \mathbf x_* } \bigg)\Sigma'^{-1}\mu'\bigg) \\ &= \sigma_{y_*}^2 a\bigg( \mathbf x_*^T \mu'\bigg(\frac{1}{1+a\mathbf x^*\Sigma' \mathbf x} \bigg) \bigg) \\ &= \cancel{\sigma_{y_*}^2} \bigg( \mathbf x_*^T \mu'\bigg(\frac{1}{\cancel{\frac{1}{a}+\mathbf x^*\Sigma' \mathbf x}} \bigg) \bigg) \\ &= \mathbf x_*^T \mu'. \end{aligned}

These are the predictive variance and predictive mean for logistic regression with Gaussian prior or ridge regression.

We can look closer at σy2\sigma_{y_*}^2. Here the first term σ2\sigma^2 is a constrant we assume; more precisely it tells us about aleatoric uncertainty, which is the uncertainy due to noise in measurement. On the other hand, the second term xTΣx\mathbf x_*^T \Sigma' \mathbf x_* is what we are interested in if we make prediction. It captures epistemic uncertainty, which indicates the level of knowledge one does not have in the problem or the model. Therefore, if one is interested in the uncertainty of her/his model ff's prediction, one can determine the uncertainy by

Var(f)=xTΣx.\text{Var}(f) = \mathbf x_*^T \Sigma' \mathbf x_*.


Now, it is time to put things together. We take a dataset and train a linear regression model on four different subsets. We assume that all train samples are in the range [1,1][-1, 1], while test samples are [1.5,1.5][-1.5, 1.5].

Fig. 1: Ridge regression trained with data with different sizes; the more training data the more certain prediction it is, especially in the extrapolation regime.

From Fig. 1, we see the effect of extrapolation in the range that our training data does not cover. Without the posterior distribution, we would not know how much uncertainty we had if we relied the point-estimate of the solution (i.e. solutions of ML or MAP). Nevertheless, as we have more training samples, our model does not only do well in the interpolation regime (where training data is covered) but also the extrapolation regime.

Conclusion and References

Perhaps, in the next step, we shall look at classification tasks and how to estimate uncertainty of the prediction in such situations.

This post is my recap of Yarin Gal's tutorial at SMILES 2019. I also consulted mathematicalmok's Youtube channel.

The figure is generated from this Google Colab.