Kernel Ridge Regression and Gaussian Processes

February 20, 2021

Background: Linear Regression

Consider that we have a dataset D={(xi,yi)}i=1n\mathcal{D} = \{ (\mathbf x_i, y_i)\}_{i=1}^n. We assume that

y=vTx+ϵ,y = \mathbf v^T \mathbf x + \epsilon,

where v\mathbf v is the true function and ϵN(0,σϵ2)\epsilon \sim \mathcal{N}(0, \sigma^2_\epsilon). Then, we can construct the likelihood

l(D;w):=p(Dw)=i=1np(yixi;w).l(\mathcal D; \mathbf w) := p(\mathcal D | \mathbf w) =\prod_{i=1}^n p(y_i | \mathbf x_i; \mathbf w).

Because yN(wTx,σ2)y \sim \mathcal{N}(\mathbf w^T\mathbf x, \sigma^2). The expected (empirical) log likelihood is the squared loss:

L(w)=1ni=1n(yiwTxi)2=1ni=1n(wTxixiTw2wTxiyi+yi2)\begin{aligned} \mathcal{L}(\mathbf w) &= \frac{1}{n} \sum_{i=1}^n (y_i - \mathbf w^T \mathbf x_i)^2 \\ &= \frac{1}{n} \sum_{i=1}^n (\mathbf w^T \mathbf x_i \mathbf x_i^T \mathbf w - 2\mathbf w^T \mathbf x_i y_i + y_i^2) \end{aligned}

Taking derivative w.r.t w\mathbf w yields

wL(w)=1ni=1n2xixiTw2xiyi.\nabla_\mathbf w \mathcal{L}(\mathbf w) = \frac{1}{n} \sum_{i=1}^n 2 \mathbf x_i \mathbf x_i^T \mathbf w - 2 \mathbf x_i y_i.

Let XRn×d=(x1,,xn)T\mathbf X \in \reals^{n\times d} = (\mathbf x_1, \dots, \mathbf x_n)^T and y=(y1,,yn)\mathbf y = (y_1, \dots, y_n) . The derivative is

wL(w)=1n(2XTXw2XTy).\nabla_\mathbf w \mathcal{L}(\mathbf w) = \frac{1}{n} \bigg( 2\mathbf X^T \mathbf X \mathbf w - 2 \mathbf X^T \mathbf y \bigg).

Setting the derivative to zero yields

w=(XTX)1XTy.\mathbf w = \big(\mathbf X^T \mathbf X\big)^{-1} \mathbf X^T \mathbf y.

For ridge regression, which has an additional term (called regularizer), λw22\lambda \| \mathbf w \|_2^2. The solution is

w=(XTX+λId)1XTy.\mathbf w = (\mathbf X^T \mathbf X + \lambda I_d )^{-1} \mathbf X^T \mathbf y.

Nonlinear Regression with Kernel : Kernel Ridge Regression

Instead of using xXRd\mathbf x \in \mathcal{X} \subseteq \reals^d, we can consider transforming x\mathbf x into some feature space first. More precisely, we would like to find ϕ:XH\phi: \mathcal{X} \rightarrow \mathcal{H} . We then assume a model

y=wTϕ(x)+ϵ.y = \mathbf w^T \phi(\mathbf x) + \epsilon.

Hence, the solution can be founded using the regularized least squared equation:

w=(ϕ(X)Tϕ(X)+λIh)1ϕ(X)Ty.\mathbf w = (\phi(\mathbf X)^T \phi(\mathbf X) + \lambda I_h)^{-1} \phi(\mathbf X)^T \mathbf y.

However, finding such ϕ(x)H\phi(\mathbf x) \mapsto \mathcal{H} is not trivial. To overcome this, we instead find ϕ()\phi(\cdot) implicity through a kernel function k:X×XR+k: \mathcal X \times \mathcal X \rightarrow \reals_+. Under technical conditions on kk, it is known that

k(xi,xj)=ϕ(xi),ϕ(xj)H=ϕ(xi)Tϕ(xj).\begin{aligned} k(\mathbf x_i, \mathbf x_j) &= \langle \phi( \mathbf x_i) , \phi( \mathbf x_j)\rangle_\mathcal{H} \\ &= \phi(\mathbf x_i)^T \phi(\mathbf x_j). \end{aligned}

Consider a test sample x\mathbf x^* . The prediction is

f(x)=ϕ(x)T(ϕ(X)ϕ(X)+λIh)1ϕ(X)Ty=ϕ(x)T(ϕ(X)Tϕ(X)+λIh)1ϕ(X)T(ϕ(X)ϕ(X)T+λIh)(ϕ(X)ϕ(X)T+λIh)1y=ϕ(x)T(ϕ(X)Tϕ(X)+λIh)1(ϕ(X)Tϕ(X)ϕ(X)T+λϕ(X)T)(ϕ(X)ϕ(X)T+λIh)1y=ϕ(x)T(ϕ(X)Tϕ(X)+λIh)1(ϕ(X)Tϕ(X)+λIh)ϕ(X)T(ϕ(X)ϕ(X)T+λIh)1y=ϕ(x)Tϕ(X)T(ϕ(X)ϕ(X)T+λIh)1y\begin{aligned} f(\mathbf x^*) &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)\phi(\mathbf X) + \lambda I_h \bigg)^{-1} \phi(\mathbf X)^T \mathbf y \\ &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X)+ \lambda I_h \bigg)^{-1} \phi(\mathbf X)^T \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg) \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X)+ \lambda I_h \bigg)^{-1} \bigg( \phi(\mathbf X)^T\phi(\mathbf X)\phi(\mathbf X)^T + \lambda \phi(\mathbf X)^T \bigg) \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X)+ \lambda I_h \bigg)^{-1} \bigg( \phi(\mathbf X)^T\phi(\mathbf X) + \lambda I_h \bigg) \phi(\mathbf X)^T \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x^*)^T \phi(\mathbf X)^T \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \end{aligned}

Define kX(x):=ϕ(X)ϕ(x)k_{\mathbf X}(\mathbf x^*):= \phi(\mathbf X) \phi(\mathbf x^*) and K=ϕ(X)ϕ(X)T\mathbf K = \phi(\mathbf{X})\phi(\mathbf{X})^T. From the representer theorem, we also know that the solution w\mathbf w is the span of k(,xi)H,i=[1,n]k(\cdot, \mathbf x_i) \in \mathcal{H}, \forall i = [1, n]. More precisely, it is in the following form:

w=i=1nαik(,xi).\mathbf w = \sum_{i=1}^n \alpha_i k(\cdot, \mathbf x_i).

Therefore, in other words, we have

f(x)=kX(x)T(K+λIh)1y=:α,f(\mathbf x^*) = k_\mathbf X(\mathbf x^*)^T\underbrace{(\mathbf K + \lambda I_h)^{-1} \mathbf y}_{=: \alpha^*},

where α=(α1,,αn)\alpha^* = (\alpha_1, \dots, \alpha_n).

Fig. 1: Prediction from linear regression, kernel ridge regression, and Gaussian processes.

Gaussian Processes: Bayesian Kernel Regression

From Bayes' rule, we know that

p(wD)=p(Dw)p(w)p(D).p(\mathbf w | \mathcal D) = \frac{p(\mathcal D| \mathbf w) p(\mathbf w)}{p(\mathcal D)}.

In general, p(D)=p(Dw)p(w)dwp(\mathcal D) = \int p(\mathcal D | \mathbf w) p(\mathbf w) d\mathbf w is intractable. However, if we choose p(w)p(\mathbf w) appropriately, the term can be computed analytically. In particular, in the case of regression (i.e. Gaussian likelihood), we can choose p(w)=N(0,σw2I)p(\mathbf w) = \mathcal{N}(0, \sigma_w^2 I). Because Gaussian distributions are close under multiplication, the posterior distribution is also a Gaussian distribution. In this case, we have

wN(μ,Σ),\mathbf w \sim \mathcal {N}(\mu', \Sigma'),

where Σ=(σϵ2XTX+σw2Id)1\Sigma' = (\sigma^{-2}_\epsilon\mathbf X^T \mathbf X + \sigma^{-2}_wI_d)^{-1} and μ=σϵ2ΣXy\mu' = \sigma^{-2}_\epsilon \Sigma' \mathbf X \mathbf y.

Let's consider using the feature map ϕ:XH\phi : \mathcal X \rightarrow \mathcal H, we can rewrite μ,Σ\mu' ,\Sigma' as

Σ=(σϵ2ϕ(X)Tϕ(X)+Ih)1μ=σϵ2Σϕ(X)Ty.\begin{aligned} \Sigma' &= \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X)^T \phi(\mathbf X) + I_h \bigg)^{-1} \\ \mu ' &= \sigma^{-2}_\epsilon \Sigma'\phi(\mathbf X)^T \mathbf y. \end{aligned}

For linear regression with Gaussian prior (i.e. ridge regression), the predictive mean and variance are

μy=xTμσy2=σϵ2+xTΣx\begin{aligned} \mu_{y_*} &= \mathbf x_*^T \mu' \\ \sigma^2_{y_*} &= \sigma^2_\epsilon + \mathbf x_*^T \Sigma' \mathbf x_* \end{aligned}

Therefore, we have

μy=ϕ(x)Tμ=σϵ2ϕ(x)TΣϕ(X)y=σϵ2ϕ(x)T(σϵ2ϕ(X)Tϕ(X)+Ih)1ϕ(X)T(σϵ2ϕ(X)ϕ(X)T+Ih)(σϵ2ϕ(X)ϕ(X)T+Ih)1y=σϵ2ϕ(x)Tϕ(X)T(σϵ2ϕ(X)ϕ(X)T+Ih)1y=ϕ(x)Tϕ(X)T(ϕ(X)ϕ(X)T+σϵ2Ih)1y=kX(x)(K+σϵ2Ih)1y\begin{aligned} \mu_{y_*} &= \phi(\mathbf x_*)^T \mu' \\ &= \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \Sigma'\phi(\mathbf X) \mathbf y \\ &= \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X)^T \phi(\mathbf X) + I_h \bigg)^{-1} \phi(\mathbf X)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T + I_h \bigg) \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T + I_h\bigg)^{-1} \mathbf y \\ &= \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \phi(\mathbf X)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T + I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x_*)^T \phi(\mathbf X)^T \bigg( \phi(\mathbf X) \phi(\mathbf X)^T + \sigma^2_\epsilon I_h\bigg)^{-1} \mathbf y \\ &= k_\mathbf X(\mathbf x_*) (\mathbf K + \sigma^2_\epsilon I_h)^{-1} \mathbf y \end{aligned}

Similarly, using the Sherman-Morrison-Woodbury formula, we have

ϕ(x)TΣϕ(x)=ϕ(x)T(σϵ2ϕ(X)Tϕ(X)+Ih)1ϕ(x)=ϕ(x)T(IhIhσϵ2ϕ(X)T(Ih+σϵ2ϕ(X)ϕ(X)T)1ϕ(X))ϕ(x)=ϕ(x)Tϕ(x)σϵ2ϕ(x)Tϕ(X)T(Ih+σϵ2ϕ(X)ϕ(X)T)1ϕ(X)ϕ(x)=k(x,x)σϵ2kX(x)T(Ih+σϵ2K)1kX(x)=k(x,x)kX(x)T(σϵ2Ih+K)1kX(x).\begin{aligned} \phi(\mathbf x_*)^T \Sigma' \phi(\mathbf x_*) &= \phi(\mathbf x_*)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X)^T \phi(\mathbf X) + I_h \bigg)^{-1} \phi(\mathbf x_*)\\ &= \phi(\mathbf x_*)^T \bigg( I_h - I_h \sigma^{-2}_\epsilon \phi(\mathbf X)^T \bigg( I_h + \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T \bigg)^{-1} \phi(\mathbf X) \bigg) \phi(\mathbf x_*) \\ &= \phi(\mathbf x_*)^T\phi(\mathbf x_*) - \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \phi(\mathbf X)^T \bigg( I_h + \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T \bigg)^{-1} \phi(\mathbf X) \phi(\mathbf x_*) \\ &= k(\mathbf x_*, \mathbf x_*) - \sigma^{-2}_\epsilon k_{\mathbf X}(\mathbf x_*)^T\bigg( I_h+ \sigma^{-2}_\epsilon \mathbf K\bigg)^{-1} k_{\mathbf X}(\mathbf x_*) \\ &= k(\mathbf x_*, \mathbf x_*) - k_{\mathbf X}(\mathbf x_*)^T\bigg( \sigma^2_\epsilon I_h+ \mathbf K\bigg)^{-1} k_{\mathbf X}(\mathbf x_*). \end{aligned}

In summary, we have

p(yx,D)=N(kX(x)(K+σϵ2Ih)1y,k(x,x)kX(x)T(σϵ2Ih+K)1kX(x)).\begin{aligned} p(y_*| \mathbf x_*, \mathcal D) = \mathcal N\bigg( k_\mathbf X(\mathbf x_*) (\mathbf K + \sigma^2_\epsilon I_h)^{-1} \mathbf y , k(\mathbf x_*, \mathbf x_*) - k_{\mathbf X}(\mathbf x_*)^T\bigg( \sigma^2_\epsilon I_h+ \mathbf K\bigg)^{-1} k_{\mathbf X}(\mathbf x_*) \bigg). \end{aligned}
Fig. 2: linear regression and Gaussian processes trained with datasets with different sizes.


These are materials I consulted while writing this article:

Figures are maded using Google Colab.