Background: Linear Regression Consider that we have  a dataset D = { ( x i , y i ) } i = 1 n \mathcal{D} = \{ (\mathbf x_i, y_i)\}_{i=1}^n D = { ( x i  , y i  ) } i = 1 n  
y = v T x + ϵ , y = \mathbf v^T \mathbf x + \epsilon, y = v T x + ϵ , where v \mathbf v v ϵ ∼ N ( 0 , σ ϵ 2 ) \epsilon \sim \mathcal{N}(0, \sigma^2_\epsilon) ϵ ∼ N ( 0 , σ ϵ 2  ) 
l ( D ; w ) : = p ( D ∣ w ) = ∏ i = 1 n p ( y i ∣ x i ; w ) . l(\mathcal D; \mathbf w) := p(\mathcal D | \mathbf w) =\prod_{i=1}^n p(y_i | \mathbf x_i; \mathbf w). l ( D ; w ) : = p ( D ∣ w ) = i = 1 ∏ n  p ( y i  ∣ x i  ; w ) . Because y ∼ N ( w T x , σ 2 ) y \sim \mathcal{N}(\mathbf w^T\mathbf x, \sigma^2) y ∼ N ( w T x , σ 2 ) 
L ( w ) = 1 n ∑ i = 1 n ( y i − w T x i ) 2 = 1 n ∑ i = 1 n ( w T x i x i T w − 2 w T x i y i + y i 2 ) \begin{aligned} \mathcal{L}(\mathbf w) &= \frac{1}{n} \sum_{i=1}^n (y_i - \mathbf w^T \mathbf x_i)^2 \\ &= \frac{1}{n} \sum_{i=1}^n (\mathbf w^T \mathbf x_i \mathbf x_i^T \mathbf w - 2\mathbf w^T \mathbf x_i y_i + y_i^2) \end{aligned} L ( w )  = n 1  i = 1 ∑ n  ( y i  − w T x i  ) 2 = n 1  i = 1 ∑ n  ( w T x i  x i T  w − 2 w T x i  y i  + y i 2  )  Taking derivative w.r.t w \mathbf w w 
∇ w L ( w ) = 1 n ∑ i = 1 n 2 x i x i T w − 2 x i y i . \nabla_\mathbf w \mathcal{L}(\mathbf w) = \frac{1}{n} \sum_{i=1}^n 2 \mathbf x_i \mathbf x_i^T \mathbf w - 2 \mathbf x_i y_i. ∇ w  L ( w ) = n 1  i = 1 ∑ n  2 x i  x i T  w − 2 x i  y i  . Let X ∈ R n × d = ( x 1 , … , x n ) T \mathbf X \in \reals^{n\times d} = (\mathbf x_1, \dots, \mathbf x_n)^T X ∈ R n × d = ( x 1  , … , x n  ) T y = ( y 1 , … , y n ) \mathbf y = (y_1, \dots, y_n) y = ( y 1  , … , y n  ) 
∇ w L ( w ) = 1 n ( 2 X T X w − 2 X T y ) . \nabla_\mathbf w \mathcal{L}(\mathbf w) = \frac{1}{n} \bigg( 2\mathbf X^T \mathbf X \mathbf w - 2 \mathbf X^T \mathbf y \bigg). ∇ w  L ( w ) = n 1  ( 2 X T X w − 2 X T y ) . Setting the derivative to zero yields
w = ( X T X ) − 1 X T y . \mathbf w = \big(\mathbf X^T \mathbf X\big)^{-1} \mathbf X^T \mathbf y. w = ( X T X ) − 1 X T y . For ridge regression, which has an additional term (called regularizer), λ ∥ w ∥ 2 2 \lambda \| \mathbf w \|_2^2 λ ∥ w ∥ 2 2  
w = ( X T X + λ I d ) − 1 X T y . \mathbf w = (\mathbf X^T \mathbf X + \lambda I_d )^{-1} \mathbf X^T \mathbf y. w = ( X T X + λ I d  ) − 1 X T y . Nonlinear Regression with Kernel : Kernel Ridge Regression Instead of using x ∈ X ⊆ R d \mathbf x \in \mathcal{X} \subseteq \reals^d x ∈ X ⊆ R d x \mathbf x x ϕ : X → H \phi: \mathcal{X} \rightarrow \mathcal{H} ϕ : X → H 
y = w T ϕ ( x ) + ϵ . y = \mathbf w^T \phi(\mathbf x) + \epsilon. y = w T ϕ ( x ) + ϵ . Hence, the solution can be founded using the regularized least squared equation:
w = ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ϕ ( X ) T y . \mathbf w = (\phi(\mathbf X)^T \phi(\mathbf X) + \lambda I_h)^{-1} \phi(\mathbf X)^T \mathbf y. w = ( ϕ ( X ) T ϕ ( X ) + λ I h  ) − 1 ϕ ( X ) T y . However, finding such ϕ ( x ) ↦ H \phi(\mathbf x) \mapsto \mathcal{H} ϕ ( x ) ↦ H ϕ ( ⋅ ) \phi(\cdot) ϕ ( ⋅ ) k : X × X → R + k: \mathcal X \times \mathcal X \rightarrow \reals_+ k : X × X → R +  technical conditions  on k k k 
k ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ H = ϕ ( x i ) T ϕ ( x j ) . \begin{aligned} k(\mathbf x_i, \mathbf x_j) &= \langle \phi( \mathbf x_i) , \phi( \mathbf x_j)\rangle_\mathcal{H} \\ &= \phi(\mathbf x_i)^T \phi(\mathbf x_j). \end{aligned} k ( x i  , x j  )  = ⟨ ϕ ( x i  ) , ϕ ( x j  ) ⟩ H  = ϕ ( x i  ) T ϕ ( x j  ) .  Consider a test sample x ∗ \mathbf x^* x ∗ 
f ( x ∗ ) = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ϕ ( X ) T y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h ) ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ( ϕ ( X ) T ϕ ( X ) ϕ ( X ) T + λ ϕ ( X ) T ) ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ( ϕ ( X ) T ϕ ( X ) + λ I h ) ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y = ϕ ( x ∗ ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y \begin{aligned} f(\mathbf x^*) &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X) + \lambda I_h \bigg)^{-1} \phi(\mathbf X)^T \mathbf y \\ &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X)+ \lambda I_h \bigg)^{-1} \phi(\mathbf X)^T \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg) \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X)+ \lambda I_h \bigg)^{-1} \bigg( \phi(\mathbf X)^T\phi(\mathbf X)\phi(\mathbf X)^T + \lambda \phi(\mathbf X)^T \bigg) \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X)+ \lambda I_h \bigg)^{-1} \bigg( \phi(\mathbf X)^T\phi(\mathbf X) + \lambda I_h \bigg) \phi(\mathbf X)^T \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x^*)^T \phi(\mathbf X)^T \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \end{aligned} f ( x ∗ )  = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h  ) − 1 ϕ ( X ) T y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h  ) − 1 ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h  ) ( ϕ ( X ) ϕ ( X ) T + λ I h  ) − 1 y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h  ) − 1 ( ϕ ( X ) T ϕ ( X ) ϕ ( X ) T + λ ϕ ( X ) T ) ( ϕ ( X ) ϕ ( X ) T + λ I h  ) − 1 y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h  ) − 1 ( ϕ ( X ) T ϕ ( X ) + λ I h  ) ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h  ) − 1 y = ϕ ( x ∗ ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h  ) − 1 y  Define k X ( x ∗ ) : = ϕ ( X ) ϕ ( x ∗ ) k_{\mathbf X}(\mathbf x^*):= \phi(\mathbf X) \phi(\mathbf x^*) k X  ( x ∗ ) : = ϕ ( X ) ϕ ( x ∗ ) K = ϕ ( X ) ϕ ( X ) T \mathbf K = \phi(\mathbf{X})\phi(\mathbf{X})^T K = ϕ ( X ) ϕ ( X ) T the representer theorem , we also know that the solution w \mathbf w w k ( ⋅ , x i ) ∈ H , ∀ i = [ 1 , n ] k(\cdot, \mathbf x_i) \in \mathcal{H}, \forall i = [1, n] k ( ⋅ , x i  ) ∈ H , ∀ i = [ 1 , n ] 
w = ∑ i = 1 n α i k ( ⋅ , x i ) . \mathbf w = \sum_{i=1}^n \alpha_i k(\cdot, \mathbf x_i). w = i = 1 ∑ n  α i  k ( ⋅ , x i  ) . Therefore, in other words, we have 
f ( x ∗ ) = k X ( x ∗ ) T ( K + λ I h ) − 1 y ⏟ = : α ∗ , f(\mathbf x^*) = k_\mathbf X(\mathbf x^*)^T\underbrace{(\mathbf K + \lambda I_h)^{-1} \mathbf y}_{=: \alpha^*}, f ( x ∗ ) = k X  ( x ∗ ) T = : α ∗ ( K + λ I h  ) − 1 y   , where α ∗ = ( α 1 , … , α n ) \alpha^* = (\alpha_1, \dots, \alpha_n) α ∗ = ( α 1  , … , α n  ) 
Fig. 1: Prediction from linear regression, kernel ridge regression, and Gaussian processes.
Gaussian Processes: Bayesian Kernel Regression From Bayes' rule, we know that 
p ( w ∣ D ) = p ( D ∣ w ) p ( w ) p ( D ) . p(\mathbf w | \mathcal D) = \frac{p(\mathcal D| \mathbf w) p(\mathbf w)}{p(\mathcal D)}. p ( w ∣ D ) = p ( D ) p ( D ∣ w ) p ( w )  . In general, p ( D ) = ∫ p ( D ∣ w ) p ( w ) d w p(\mathcal D) = \int p(\mathcal D | \mathbf w) p(\mathbf w) d\mathbf w p ( D ) = ∫ p ( D ∣ w ) p ( w ) d w p ( w ) p(\mathbf w) p ( w ) p ( w ) = N ( 0 , σ w 2 I ) p(\mathbf w) = \mathcal{N}(0, \sigma_w^2 I) p ( w ) = N ( 0 , σ w 2  I ) 
w ∼ N ( μ ′ , Σ ′ ) , \mathbf w \sim \mathcal {N}(\mu', \Sigma'), w ∼ N ( μ ′ , Σ ′ ) , where Σ ′ = ( σ ϵ − 2 X T X + σ w − 2 I d ) − 1 \Sigma' = (\sigma^{-2}_\epsilon\mathbf X^T \mathbf X + \sigma^{-2}_wI_d)^{-1} Σ ′ = ( σ ϵ − 2  X T X + σ w − 2  I d  ) − 1 μ ′ = σ ϵ − 2 Σ ′ X y \mu' = \sigma^{-2}_\epsilon \Sigma' \mathbf X \mathbf y μ ′ = σ ϵ − 2  Σ ′ X y 
Let's consider using the feature map ϕ : X → H \phi : \mathcal X \rightarrow \mathcal H ϕ : X → H μ ′ , Σ ′ \mu' ,\Sigma' μ ′ , Σ ′ 
Σ ′ = ( σ ϵ − 2 ϕ ( X ) T ϕ ( X ) + I h ) − 1 μ ′ = σ ϵ − 2 Σ ′ ϕ ( X ) T y . \begin{aligned} \Sigma' &= \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X)^T \phi(\mathbf X) + I_h \bigg)^{-1} \\ \mu ' &= \sigma^{-2}_\epsilon \Sigma'\phi(\mathbf X)^T \mathbf y. \end{aligned} Σ ′ μ ′  = ( σ ϵ − 2  ϕ ( X ) T ϕ ( X ) + I h  ) − 1 = σ ϵ − 2  Σ ′ ϕ ( X ) T y .  For linear regression with Gaussian prior (i.e. ridge regression), the predictive mean and variance are
μ y ∗ = x ∗ T μ ′ σ y ∗ 2 = σ ϵ 2 + x ∗ T Σ ′ x ∗ \begin{aligned} \mu_{y_*} &= \mathbf x_*^T \mu' \\ \sigma^2_{y_*} &= \sigma^2_\epsilon + \mathbf x_*^T \Sigma' \mathbf x_* \end{aligned} μ y ∗   σ y ∗  2   = x ∗ T  μ ′ = σ ϵ 2  + x ∗ T  Σ ′ x ∗   Therefore, we have 
μ y ∗ = ϕ ( x ∗ ) T μ ′ = σ ϵ − 2 ϕ ( x ∗ ) T Σ ′ ϕ ( X ) y = σ ϵ − 2 ϕ ( x ∗ ) T ( σ ϵ − 2 ϕ ( X ) T ϕ ( X ) + I h ) − 1 ϕ ( X ) T ( σ ϵ − 2 ϕ ( X ) ϕ ( X ) T + I h ) ( σ ϵ − 2 ϕ ( X ) ϕ ( X ) T + I h ) − 1 y = σ ϵ − 2 ϕ ( x ∗ ) T ϕ ( X ) T ( σ ϵ − 2 ϕ ( X ) ϕ ( X ) T + I h ) − 1 y = ϕ ( x ∗ ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + σ ϵ 2 I h ) − 1 y = k X ( x ∗ ) ( K + σ ϵ 2 I h ) − 1 y \begin{aligned} \mu_{y_*} &= \phi(\mathbf x_*)^T \mu' \\ &= \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \Sigma'\phi(\mathbf X) \mathbf y \\ &= \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X)^T \phi(\mathbf X) + I_h \bigg)^{-1} \phi(\mathbf X)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T + I_h \bigg) \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T + I_h\bigg)^{-1} \mathbf y \\ &= \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \phi(\mathbf X)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T + I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x_*)^T \phi(\mathbf X)^T \bigg( \phi(\mathbf X) \phi(\mathbf X)^T + \sigma^2_\epsilon I_h\bigg)^{-1} \mathbf y \\ &= k_\mathbf X(\mathbf x_*) (\mathbf K + \sigma^2_\epsilon I_h)^{-1} \mathbf y \end{aligned} μ y ∗    = ϕ ( x ∗  ) T μ ′ = σ ϵ − 2  ϕ ( x ∗  ) T Σ ′ ϕ ( X ) y = σ ϵ − 2  ϕ ( x ∗  ) T ( σ ϵ − 2  ϕ ( X ) T ϕ ( X ) + I h  ) − 1 ϕ ( X ) T ( σ ϵ − 2  ϕ ( X ) ϕ ( X ) T + I h  ) ( σ ϵ − 2  ϕ ( X ) ϕ ( X ) T + I h  ) − 1 y = σ ϵ − 2  ϕ ( x ∗  ) T ϕ ( X ) T ( σ ϵ − 2  ϕ ( X ) ϕ ( X ) T + I h  ) − 1 y = ϕ ( x ∗  ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + σ ϵ 2  I h  ) − 1 y = k X  ( x ∗  ) ( K + σ ϵ 2  I h  ) − 1 y  Similarly, using the Sherman-Morrison-Woodbury  formula, we have
ϕ ( x ∗ ) T Σ ′ ϕ ( x ∗ ) = ϕ ( x ∗ ) T ( σ ϵ − 2 ϕ ( X ) T ϕ ( X ) + I h ) − 1 ϕ ( x ∗ ) = ϕ ( x ∗ ) T ( I h − I h σ ϵ − 2 ϕ ( X ) T ( I h + σ ϵ − 2 ϕ ( X ) ϕ ( X ) T ) − 1 ϕ ( X ) ) ϕ ( x ∗ ) = ϕ ( x ∗ ) T ϕ ( x ∗ ) − σ ϵ − 2 ϕ ( x ∗ ) T ϕ ( X ) T ( I h + σ ϵ − 2 ϕ ( X ) ϕ ( X ) T ) − 1 ϕ ( X ) ϕ ( x ∗ ) = k ( x ∗ , x ∗ ) − σ ϵ − 2 k X ( x ∗ ) T ( I h + σ ϵ − 2 K ) − 1 k X ( x ∗ ) = k ( x ∗ , x ∗ ) − k X ( x ∗ ) T ( σ ϵ 2 I h + K ) − 1 k X ( x ∗ ) . \begin{aligned} \phi(\mathbf x_*)^T \Sigma' \phi(\mathbf x_*) &= \phi(\mathbf x_*)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X)^T \phi(\mathbf X) + I_h \bigg)^{-1} \phi(\mathbf x_*)\\ &= \phi(\mathbf x_*)^T \bigg( I_h - I_h \sigma^{-2}_\epsilon \phi(\mathbf X)^T \bigg( I_h + \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T \bigg)^{-1} \phi(\mathbf X) \bigg) \phi(\mathbf x_*) \\ &= \phi(\mathbf x_*)^T\phi(\mathbf x_*) - \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \phi(\mathbf X)^T \bigg( I_h + \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T \bigg)^{-1} \phi(\mathbf X) \phi(\mathbf x_*) \\ &= k(\mathbf x_*, \mathbf x_*) - \sigma^{-2}_\epsilon k_{\mathbf X}(\mathbf x_*)^T\bigg( I_h+ \sigma^{-2}_\epsilon \mathbf K\bigg)^{-1} k_{\mathbf X}(\mathbf x_*) \\ &= k(\mathbf x_*, \mathbf x_*) - k_{\mathbf X}(\mathbf x_*)^T\bigg( \sigma^2_\epsilon I_h+ \mathbf K\bigg)^{-1} k_{\mathbf X}(\mathbf x_*). \end{aligned} ϕ ( x ∗  ) T Σ ′ ϕ ( x ∗  )  = ϕ ( x ∗  ) T ( σ ϵ − 2  ϕ ( X ) T ϕ ( X ) + I h  ) − 1 ϕ ( x ∗  ) = ϕ ( x ∗  ) T ( I h  − I h  σ ϵ − 2  ϕ ( X ) T ( I h  + σ ϵ − 2  ϕ ( X ) ϕ ( X ) T ) − 1 ϕ ( X ) ) ϕ ( x ∗  ) = ϕ ( x ∗  ) T ϕ ( x ∗  ) − σ ϵ − 2  ϕ ( x ∗  ) T ϕ ( X ) T ( I h  + σ ϵ − 2  ϕ ( X ) ϕ ( X ) T ) − 1 ϕ ( X ) ϕ ( x ∗  ) = k ( x ∗  , x ∗  ) − σ ϵ − 2  k X  ( x ∗  ) T ( I h  + σ ϵ − 2  K ) − 1 k X  ( x ∗  ) = k ( x ∗  , x ∗  ) − k X  ( x ∗  ) T ( σ ϵ 2  I h  + K ) − 1 k X  ( x ∗  ) .  In summary, we have
p ( y ∗ ∣ x ∗ , D ) = N ( k X ( x ∗ ) ( K + σ ϵ 2 I h ) − 1 y , k ( x ∗ , x ∗ ) − k X ( x ∗ ) T ( σ ϵ 2 I h + K ) − 1 k X ( x ∗ ) ) . \begin{aligned} p(y_*| \mathbf x_*, \mathcal D) = \mathcal N\bigg( k_\mathbf X(\mathbf x_*) (\mathbf K + \sigma^2_\epsilon I_h)^{-1} \mathbf y , k(\mathbf x_*, \mathbf x_*) - k_{\mathbf X}(\mathbf x_*)^T\bigg( \sigma^2_\epsilon I_h+ \mathbf K\bigg)^{-1} k_{\mathbf X}(\mathbf x_*) \bigg). \end{aligned} p ( y ∗  ∣ x ∗  , D ) = N ( k X  ( x ∗  ) ( K + σ ϵ 2  I h  ) − 1 y , k ( x ∗  , x ∗  ) − k X  ( x ∗  ) T ( σ ϵ 2  I h  + K ) − 1 k X  ( x ∗  ) ) .  Fig. 2: linear regression and Gaussian processes trained with datasets with different sizes.
References These are materials I consulted while writing this article:
Figures are made using Google Colab .