Background: Linear Regression Consider that we have a dataset D = { ( x i , y i ) } i = 1 n \mathcal{D} = \{ (\mathbf x_i, y_i)\}_{i=1}^n D = { ( x i , y i ) } i = 1 n . We assume that
y = v T x + ϵ , y = \mathbf v^T \mathbf x + \epsilon, y = v T x + ϵ , where v \mathbf v v is the true function and ϵ ∼ N ( 0 , σ ϵ 2 ) \epsilon \sim \mathcal{N}(0, \sigma^2_\epsilon) ϵ ∼ N ( 0 , σ ϵ 2 ) . Then, we can construct the likelihood
l ( D ; w ) : = p ( D ∣ w ) = ∏ i = 1 n p ( y i ∣ x i ; w ) . l(\mathcal D; \mathbf w) := p(\mathcal D | \mathbf w) =\prod_{i=1}^n p(y_i | \mathbf x_i; \mathbf w). l ( D ; w ) : = p ( D ∣ w ) = i = 1 ∏ n p ( y i ∣ x i ; w ) . Because y ∼ N ( w T x , σ 2 ) y \sim \mathcal{N}(\mathbf w^T\mathbf x, \sigma^2) y ∼ N ( w T x , σ 2 ) . The expected (empirical) log likelihood is the squared loss:
L ( w ) = 1 n ∑ i = 1 n ( y i − w T x i ) 2 = 1 n ∑ i = 1 n ( w T x i x i T w − 2 w T x i y i + y i 2 ) \begin{aligned} \mathcal{L}(\mathbf w) &= \frac{1}{n} \sum_{i=1}^n (y_i - \mathbf w^T \mathbf x_i)^2 \\ &= \frac{1}{n} \sum_{i=1}^n (\mathbf w^T \mathbf x_i \mathbf x_i^T \mathbf w - 2\mathbf w^T \mathbf x_i y_i + y_i^2) \end{aligned} L ( w ) = n 1 i = 1 ∑ n ( y i − w T x i ) 2 = n 1 i = 1 ∑ n ( w T x i x i T w − 2 w T x i y i + y i 2 ) Taking derivative w.r.t w \mathbf w w yields
∇ w L ( w ) = 1 n ∑ i = 1 n 2 x i x i T w − 2 x i y i . \nabla_\mathbf w \mathcal{L}(\mathbf w) = \frac{1}{n} \sum_{i=1}^n 2 \mathbf x_i \mathbf x_i^T \mathbf w - 2 \mathbf x_i y_i. ∇ w L ( w ) = n 1 i = 1 ∑ n 2 x i x i T w − 2 x i y i . Let X ∈ R n × d = ( x 1 , … , x n ) T \mathbf X \in \reals^{n\times d} = (\mathbf x_1, \dots, \mathbf x_n)^T X ∈ R n × d = ( x 1 , … , x n ) T and y = ( y 1 , … , y n ) \mathbf y = (y_1, \dots, y_n) y = ( y 1 , … , y n ) . The derivative is
∇ w L ( w ) = 1 n ( 2 X T X w − 2 X T y ) . \nabla_\mathbf w \mathcal{L}(\mathbf w) = \frac{1}{n} \bigg( 2\mathbf X^T \mathbf X \mathbf w - 2 \mathbf X^T \mathbf y \bigg). ∇ w L ( w ) = n 1 ( 2 X T X w − 2 X T y ) . Setting the derivative to zero yields
w = ( X T X ) − 1 X T y . \mathbf w = \big(\mathbf X^T \mathbf X\big)^{-1} \mathbf X^T \mathbf y. w = ( X T X ) − 1 X T y . For ridge regression, which has an additional term (called regularizer), λ ∥ w ∥ 2 2 \lambda \| \mathbf w \|_2^2 λ ∥ w ∥ 2 2 . The solution is
w = ( X T X + λ I d ) − 1 X T y . \mathbf w = (\mathbf X^T \mathbf X + \lambda I_d )^{-1} \mathbf X^T \mathbf y. w = ( X T X + λ I d ) − 1 X T y . Nonlinear Regression with Kernel : Kernel Ridge Regression Instead of using x ∈ X ⊆ R d \mathbf x \in \mathcal{X} \subseteq \reals^d x ∈ X ⊆ R d , we can consider transforming x \mathbf x x into some feature space first. More precisely, we would like to find ϕ : X → H \phi: \mathcal{X} \rightarrow \mathcal{H} ϕ : X → H . We then assume a model
y = w T ϕ ( x ) + ϵ . y = \mathbf w^T \phi(\mathbf x) + \epsilon. y = w T ϕ ( x ) + ϵ . Hence, the solution can be founded using the regularized least squared equation:
w = ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ϕ ( X ) T y . \mathbf w = (\phi(\mathbf X)^T \phi(\mathbf X) + \lambda I_h)^{-1} \phi(\mathbf X)^T \mathbf y. w = ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ϕ ( X ) T y . However, finding such ϕ ( x ) ↦ H \phi(\mathbf x) \mapsto \mathcal{H} ϕ ( x ) ↦ H is not trivial. To overcome this, we instead find ϕ ( ⋅ ) \phi(\cdot) ϕ ( ⋅ ) implicity through a kernel function k : X × X → R + k: \mathcal X \times \mathcal X \rightarrow \reals_+ k : X × X → R + . Under technical conditions on k k k , it is known that
k ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ H = ϕ ( x i ) T ϕ ( x j ) . \begin{aligned} k(\mathbf x_i, \mathbf x_j) &= \langle \phi( \mathbf x_i) , \phi( \mathbf x_j)\rangle_\mathcal{H} \\ &= \phi(\mathbf x_i)^T \phi(\mathbf x_j). \end{aligned} k ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ H = ϕ ( x i ) T ϕ ( x j ) . Consider a test sample x ∗ \mathbf x^* x ∗ . The prediction is
f ( x ∗ ) = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ϕ ( X ) T y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h ) ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ( ϕ ( X ) T ϕ ( X ) ϕ ( X ) T + λ ϕ ( X ) T ) ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ( ϕ ( X ) T ϕ ( X ) + λ I h ) ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y = ϕ ( x ∗ ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y \begin{aligned} f(\mathbf x^*) &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X) + \lambda I_h \bigg)^{-1} \phi(\mathbf X)^T \mathbf y \\ &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X)+ \lambda I_h \bigg)^{-1} \phi(\mathbf X)^T \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg) \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X)+ \lambda I_h \bigg)^{-1} \bigg( \phi(\mathbf X)^T\phi(\mathbf X)\phi(\mathbf X)^T + \lambda \phi(\mathbf X)^T \bigg) \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x^*)^T\bigg( \phi(\mathbf X)^T\phi(\mathbf X)+ \lambda I_h \bigg)^{-1} \bigg( \phi(\mathbf X)^T\phi(\mathbf X) + \lambda I_h \bigg) \phi(\mathbf X)^T \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x^*)^T \phi(\mathbf X)^T \bigg( \phi(\mathbf X)\phi(\mathbf X)^T + \lambda I_h \bigg)^{-1} \mathbf y \end{aligned} f ( x ∗ ) = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ϕ ( X ) T y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h ) ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ( ϕ ( X ) T ϕ ( X ) ϕ ( X ) T + λ ϕ ( X ) T ) ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y = ϕ ( x ∗ ) T ( ϕ ( X ) T ϕ ( X ) + λ I h ) − 1 ( ϕ ( X ) T ϕ ( X ) + λ I h ) ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y = ϕ ( x ∗ ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + λ I h ) − 1 y Define k X ( x ∗ ) : = ϕ ( X ) ϕ ( x ∗ ) k_{\mathbf X}(\mathbf x^*):= \phi(\mathbf X) \phi(\mathbf x^*) k X ( x ∗ ) : = ϕ ( X ) ϕ ( x ∗ ) and K = ϕ ( X ) ϕ ( X ) T \mathbf K = \phi(\mathbf{X})\phi(\mathbf{X})^T K = ϕ ( X ) ϕ ( X ) T . From the representer theorem , we also know that the solution w \mathbf w w is the span of k ( ⋅ , x i ) ∈ H , ∀ i = [ 1 , n ] k(\cdot, \mathbf x_i) \in \mathcal{H}, \forall i = [1, n] k ( ⋅ , x i ) ∈ H , ∀ i = [ 1 , n ] . More precisely, it is in the following form:
w = ∑ i = 1 n α i k ( ⋅ , x i ) . \mathbf w = \sum_{i=1}^n \alpha_i k(\cdot, \mathbf x_i). w = i = 1 ∑ n α i k ( ⋅ , x i ) . Therefore, in other words, we have
f ( x ∗ ) = k X ( x ∗ ) T ( K + λ I h ) − 1 y ⏟ = : α ∗ , f(\mathbf x^*) = k_\mathbf X(\mathbf x^*)^T\underbrace{(\mathbf K + \lambda I_h)^{-1} \mathbf y}_{=: \alpha^*}, f ( x ∗ ) = k X ( x ∗ ) T = : α ∗ ( K + λ I h ) − 1 y , where α ∗ = ( α 1 , … , α n ) \alpha^* = (\alpha_1, \dots, \alpha_n) α ∗ = ( α 1 , … , α n ) .
Fig. 1: Prediction from linear regression, kernel ridge regression, and Gaussian processes.
Gaussian Processes: Bayesian Kernel Regression From Bayes' rule, we know that
p ( w ∣ D ) = p ( D ∣ w ) p ( w ) p ( D ) . p(\mathbf w | \mathcal D) = \frac{p(\mathcal D| \mathbf w) p(\mathbf w)}{p(\mathcal D)}. p ( w ∣ D ) = p ( D ) p ( D ∣ w ) p ( w ) . In general, p ( D ) = ∫ p ( D ∣ w ) p ( w ) d w p(\mathcal D) = \int p(\mathcal D | \mathbf w) p(\mathbf w) d\mathbf w p ( D ) = ∫ p ( D ∣ w ) p ( w ) d w is intractable. However, if we choose p ( w ) p(\mathbf w) p ( w ) appropriately, the term can be computed analytically. In particular, in the case of regression (i.e. Gaussian likelihood), we can choose p ( w ) = N ( 0 , σ w 2 I ) p(\mathbf w) = \mathcal{N}(0, \sigma_w^2 I) p ( w ) = N ( 0 , σ w 2 I ) . Because Gaussian distributions are close under multiplication, the posterior distribution is also a Gaussian distribution. In this case, we have
w ∼ N ( μ ′ , Σ ′ ) , \mathbf w \sim \mathcal {N}(\mu', \Sigma'), w ∼ N ( μ ′ , Σ ′ ) , where Σ ′ = ( σ ϵ − 2 X T X + σ w − 2 I d ) − 1 \Sigma' = (\sigma^{-2}_\epsilon\mathbf X^T \mathbf X + \sigma^{-2}_wI_d)^{-1} Σ ′ = ( σ ϵ − 2 X T X + σ w − 2 I d ) − 1 and μ ′ = σ ϵ − 2 Σ ′ X y \mu' = \sigma^{-2}_\epsilon \Sigma' \mathbf X \mathbf y μ ′ = σ ϵ − 2 Σ ′ X y .
Let's consider using the feature map ϕ : X → H \phi : \mathcal X \rightarrow \mathcal H ϕ : X → H , we can rewrite μ ′ , Σ ′ \mu' ,\Sigma' μ ′ , Σ ′ as
Σ ′ = ( σ ϵ − 2 ϕ ( X ) T ϕ ( X ) + I h ) − 1 μ ′ = σ ϵ − 2 Σ ′ ϕ ( X ) T y . \begin{aligned} \Sigma' &= \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X)^T \phi(\mathbf X) + I_h \bigg)^{-1} \\ \mu ' &= \sigma^{-2}_\epsilon \Sigma'\phi(\mathbf X)^T \mathbf y. \end{aligned} Σ ′ μ ′ = ( σ ϵ − 2 ϕ ( X ) T ϕ ( X ) + I h ) − 1 = σ ϵ − 2 Σ ′ ϕ ( X ) T y . For linear regression with Gaussian prior (i.e. ridge regression), the predictive mean and variance are
μ y ∗ = x ∗ T μ ′ σ y ∗ 2 = σ ϵ 2 + x ∗ T Σ ′ x ∗ \begin{aligned} \mu_{y_*} &= \mathbf x_*^T \mu' \\ \sigma^2_{y_*} &= \sigma^2_\epsilon + \mathbf x_*^T \Sigma' \mathbf x_* \end{aligned} μ y ∗ σ y ∗ 2 = x ∗ T μ ′ = σ ϵ 2 + x ∗ T Σ ′ x ∗ Therefore, we have
μ y ∗ = ϕ ( x ∗ ) T μ ′ = σ ϵ − 2 ϕ ( x ∗ ) T Σ ′ ϕ ( X ) y = σ ϵ − 2 ϕ ( x ∗ ) T ( σ ϵ − 2 ϕ ( X ) T ϕ ( X ) + I h ) − 1 ϕ ( X ) T ( σ ϵ − 2 ϕ ( X ) ϕ ( X ) T + I h ) ( σ ϵ − 2 ϕ ( X ) ϕ ( X ) T + I h ) − 1 y = σ ϵ − 2 ϕ ( x ∗ ) T ϕ ( X ) T ( σ ϵ − 2 ϕ ( X ) ϕ ( X ) T + I h ) − 1 y = ϕ ( x ∗ ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + σ ϵ 2 I h ) − 1 y = k X ( x ∗ ) ( K + σ ϵ 2 I h ) − 1 y \begin{aligned} \mu_{y_*} &= \phi(\mathbf x_*)^T \mu' \\ &= \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \Sigma'\phi(\mathbf X) \mathbf y \\ &= \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X)^T \phi(\mathbf X) + I_h \bigg)^{-1} \phi(\mathbf X)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T + I_h \bigg) \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T + I_h\bigg)^{-1} \mathbf y \\ &= \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \phi(\mathbf X)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T + I_h \bigg)^{-1} \mathbf y \\ &= \phi(\mathbf x_*)^T \phi(\mathbf X)^T \bigg( \phi(\mathbf X) \phi(\mathbf X)^T + \sigma^2_\epsilon I_h\bigg)^{-1} \mathbf y \\ &= k_\mathbf X(\mathbf x_*) (\mathbf K + \sigma^2_\epsilon I_h)^{-1} \mathbf y \end{aligned} μ y ∗ = ϕ ( x ∗ ) T μ ′ = σ ϵ − 2 ϕ ( x ∗ ) T Σ ′ ϕ ( X ) y = σ ϵ − 2 ϕ ( x ∗ ) T ( σ ϵ − 2 ϕ ( X ) T ϕ ( X ) + I h ) − 1 ϕ ( X ) T ( σ ϵ − 2 ϕ ( X ) ϕ ( X ) T + I h ) ( σ ϵ − 2 ϕ ( X ) ϕ ( X ) T + I h ) − 1 y = σ ϵ − 2 ϕ ( x ∗ ) T ϕ ( X ) T ( σ ϵ − 2 ϕ ( X ) ϕ ( X ) T + I h ) − 1 y = ϕ ( x ∗ ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + σ ϵ 2 I h ) − 1 y = k X ( x ∗ ) ( K + σ ϵ 2 I h ) − 1 y Similarly, using the Sherman-Morrison-Woodbury formula, we have
ϕ ( x ∗ ) T Σ ′ ϕ ( x ∗ ) = ϕ ( x ∗ ) T ( σ ϵ − 2 ϕ ( X ) T ϕ ( X ) + I h ) − 1 ϕ ( x ∗ ) = ϕ ( x ∗ ) T ( I h − I h σ ϵ − 2 ϕ ( X ) T ( I h + σ ϵ − 2 ϕ ( X ) ϕ ( X ) T ) − 1 ϕ ( X ) ) ϕ ( x ∗ ) = ϕ ( x ∗ ) T ϕ ( x ∗ ) − σ ϵ − 2 ϕ ( x ∗ ) T ϕ ( X ) T ( I h + σ ϵ − 2 ϕ ( X ) ϕ ( X ) T ) − 1 ϕ ( X ) ϕ ( x ∗ ) = k ( x ∗ , x ∗ ) − σ ϵ − 2 k X ( x ∗ ) T ( I h + σ ϵ − 2 K ) − 1 k X ( x ∗ ) = k ( x ∗ , x ∗ ) − k X ( x ∗ ) T ( σ ϵ 2 I h + K ) − 1 k X ( x ∗ ) . \begin{aligned} \phi(\mathbf x_*)^T \Sigma' \phi(\mathbf x_*) &= \phi(\mathbf x_*)^T \bigg( \sigma^{-2}_\epsilon \phi(\mathbf X)^T \phi(\mathbf X) + I_h \bigg)^{-1} \phi(\mathbf x_*)\\ &= \phi(\mathbf x_*)^T \bigg( I_h - I_h \sigma^{-2}_\epsilon \phi(\mathbf X)^T \bigg( I_h + \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T \bigg)^{-1} \phi(\mathbf X) \bigg) \phi(\mathbf x_*) \\ &= \phi(\mathbf x_*)^T\phi(\mathbf x_*) - \sigma^{-2}_\epsilon \phi(\mathbf x_*)^T \phi(\mathbf X)^T \bigg( I_h + \sigma^{-2}_\epsilon \phi(\mathbf X) \phi(\mathbf X)^T \bigg)^{-1} \phi(\mathbf X) \phi(\mathbf x_*) \\ &= k(\mathbf x_*, \mathbf x_*) - \sigma^{-2}_\epsilon k_{\mathbf X}(\mathbf x_*)^T\bigg( I_h+ \sigma^{-2}_\epsilon \mathbf K\bigg)^{-1} k_{\mathbf X}(\mathbf x_*) \\ &= k(\mathbf x_*, \mathbf x_*) - k_{\mathbf X}(\mathbf x_*)^T\bigg( \sigma^2_\epsilon I_h+ \mathbf K\bigg)^{-1} k_{\mathbf X}(\mathbf x_*). \end{aligned} ϕ ( x ∗ ) T Σ ′ ϕ ( x ∗ ) = ϕ ( x ∗ ) T ( σ ϵ − 2 ϕ ( X ) T ϕ ( X ) + I h ) − 1 ϕ ( x ∗ ) = ϕ ( x ∗ ) T ( I h − I h σ ϵ − 2 ϕ ( X ) T ( I h + σ ϵ − 2 ϕ ( X ) ϕ ( X ) T ) − 1 ϕ ( X ) ) ϕ ( x ∗ ) = ϕ ( x ∗ ) T ϕ ( x ∗ ) − σ ϵ − 2 ϕ ( x ∗ ) T ϕ ( X ) T ( I h + σ ϵ − 2 ϕ ( X ) ϕ ( X ) T ) − 1 ϕ ( X ) ϕ ( x ∗ ) = k ( x ∗ , x ∗ ) − σ ϵ − 2 k X ( x ∗ ) T ( I h + σ ϵ − 2 K ) − 1 k X ( x ∗ ) = k ( x ∗ , x ∗ ) − k X ( x ∗ ) T ( σ ϵ 2 I h + K ) − 1 k X ( x ∗ ) . In summary, we have
p ( y ∗ ∣ x ∗ , D ) = N ( k X ( x ∗ ) ( K + σ ϵ 2 I h ) − 1 y , k ( x ∗ , x ∗ ) − k X ( x ∗ ) T ( σ ϵ 2 I h + K ) − 1 k X ( x ∗ ) ) . \begin{aligned} p(y_*| \mathbf x_*, \mathcal D) = \mathcal N\bigg( k_\mathbf X(\mathbf x_*) (\mathbf K + \sigma^2_\epsilon I_h)^{-1} \mathbf y , k(\mathbf x_*, \mathbf x_*) - k_{\mathbf X}(\mathbf x_*)^T\bigg( \sigma^2_\epsilon I_h+ \mathbf K\bigg)^{-1} k_{\mathbf X}(\mathbf x_*) \bigg). \end{aligned} p ( y ∗ ∣ x ∗ , D ) = N ( k X ( x ∗ ) ( K + σ ϵ 2 I h ) − 1 y , k ( x ∗ , x ∗ ) − k X ( x ∗ ) T ( σ ϵ 2 I h + K ) − 1 k X ( x ∗ ) ) . Fig. 2: linear regression and Gaussian processes trained with datasets with different sizes.
References These are materials I consulted while writing this article:
Figures are made using Google Colab .