Regression Derivation

December 09, 2019
Table of Content

Maximum Likelihood

Consider a dataset D={(x1,y1),,(xn,yn)}\mathcal{D} = \{(x_1, y_1), \dots, (x_n, y_n)\}. We assume that yiy_i is a corrupted measurement of xix_i with some noise ϵN(0,σD2)\epsilon \sim \mathcal{N}(0, \sigma_D^2), i.e.

yi=y^i+ϵ.y_i = \hat{y}_i + \epsilon.

The goal of regression is to find a model that can capture the relationship between y^i\hat{y}_i and xix_i. Let Θ^={θ1,,θm}\hat \Theta = \{\theta_1, \dots,\theta_m \} be reasonable parameters of this model. Using Maximum Likelihood Estimation (MLE), this objective can be written as

Θ^=argmaxΘi=1nP(yixi;Θ)=argmaxΘi=1nlnP(yixi;Θ)\begin{aligned} \hat \Theta &= \underset{\Theta}\text{argmax} \prod_{i=1}^n P(y_i|x_i; \Theta) \\ &= \underset{\Theta}\text{argmax} \sum_{i=1}^n \ln P(y_i|x_i; \Theta) \end{aligned}

where P(yixi;Θ^)P(y_i|x_i; \hat \Theta) is the likelihood that we get such an observation yiy_i from xix_i under a model with parameters Θ^\hat \Theta.

Mean-Squared Error

Because we assume that yiy_i is normally distributed, the objective above is

Θ^=argmaxΘi=1nln1σD2πe12(yiy^iσD)2=argmaxΘi=1nln1σD2π+lne12(yiy^iσD)2=argmaxΘi=1nln1σD2π12(yiy^iσD)2=argmaxΘi=1n12(yiy^iσD)2=argminΘi=1n12σD2(yiy^i)2=argminΘi=1n(yiy^i)2.\begin{aligned} \hat{\Theta} &= \underset{\Theta}\text{argmax} \sum_{i=1}^n \ln \frac{1}{\sigma_D\sqrt{2\pi}} e^{ - \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2} \\ &= \underset{\Theta}\text{argmax} \sum_{i=1}^n \ln \frac{1}{\sigma_D\sqrt{2\pi}} + \ln e^{ - \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2} \\ &=\underset{\Theta}\text{argmax} \sum_{i=1}^n \cancel{\ln \frac{1}{\sigma_D\sqrt{2\pi}}} - \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 \\ &= \underset{\Theta}\text{argmax} \sum_{i=1}^n- \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 \\ &= \underset{\Theta}\text{argmin} \sum_{i=1}^n \cancel{\frac{1}{2\sigma_D^2}} \big({y_i - \hat{y}_i}\big)^2 \\ &= \underset{\Theta}\text{argmin} \sum_{i=1}^n \big(y_i - \hat{y}_i \big)^2. \end{aligned}

We cancel the two terms in the derivation because they do not depends on Θ\Theta, hence no influence on the opitimization.

Maximum a Posteriori

The MLE approach above consider only the likelihood term in Bayes' rule:

P(ΘD)Posterior=P(DΘ)LikelihoodP(Θ)PriorP(D)Evidence\underbrace{P(\Theta| \mathcal{D})}_{\text{Posterior} } = \frac{ \overbrace{P(\mathcal{D}|\Theta)}^{\text{Likelihood}} \overbrace{P(\Theta)}^{\text{Prior}}}{ \underbrace{P(\mathcal{D})}_{\text{Evidence}} }

Because P(D)P(\mathcal{D}) does not depend on Θ\Theta, the posterior is

P(ΘD)P(DΘ)P(Θ).P(\Theta| \mathcal{D}) \propto P(\mathcal{D}|\Theta) P(\Theta).

Using Maximum a Posteriori (MAP), one can find suitable parameters Θ^\hat{\Theta} by maximizing P(θD)P(\theta|\mathcal{D}):

Θ^=argmaxΘ(i=1nP(yixi;Θ))P(Θ)=argmaxΘ(i=1nlnP(yixi;Θ))+lnP(Θ).\begin{aligned} \hat \Theta &= \underset{\Theta}\text{argmax} \bigg( \prod_{i=1}^n P(y_i|x_i; \Theta) \bigg) P(\Theta) \\ &= \underset{\Theta}\text{argmax} \bigg( \sum_{i=1}^n \ln P(y_i|x_i; \Theta) \bigg) + \ln P(\Theta). \end{aligned}

Consider each parameter θiN(0,σθ2)\theta_i \sim \mathcal{N}(0, \sigma_{\theta}^2). The second term is then

lnP(Θ)=lnj=1m1σθ2πe12(θjσθ)2=j=1mln1σθ2πe12(θjσθ)2.\begin{aligned} \ln P(\Theta) &= \ln \prod_{j=1}^{m} \frac{1}{\sigma_\theta \sqrt{2\pi}} e^{ - \frac{1}{2}\big(\frac{\theta_j}{\sigma_\theta}\big)^2} \\ & = \sum_{j=1}^m \ln \frac{1}{\sigma_\theta \sqrt{2\pi}} e^{ - \frac{1}{2}\big(\frac{\theta_j}{\sigma_\theta}\big)^2}. \end{aligned}

L_2 Regularizer (Weight Decay)

Borrowing the intermediate derivation step of the likelihood term from the previous section and substituting the prior term into the optimization yield

Θ^=argmaxΘi=1n12(yiy^iσD)2+j=1mln1σθ2πe12(θjσθ)2=argmaxΘi=1n12(yiy^iσD)212j=1m(θjσθ)2mlnσθ2π=argminΘi=1n(yiy^iσD)2+j=1m(θjσθ)2=argminΘi=1n(yiy^i)2+(σDσθ)2λj=1mθj2.\begin{aligned} \hat{\Theta} &= \underset{\Theta}\text{argmax} \sum_{i=1}^n- \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 + \sum_{j=1}^m \ln \frac{1}{\sigma_\theta \sqrt{2\pi}} e^{ - \frac{1}{2}\big(\frac{\theta_j}{\sigma_\theta}\big)^2} \\ &=\underset{\Theta}\text{argmax} \sum_{i=1}^n - \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 - \frac{1}{2} \sum_{j=1}^m \bigg(\frac{\theta_j}{\sigma_\theta}\bigg)^2 - \cancel{m \ln \sigma_\theta \sqrt{2\pi}}\\ &=\underset{\Theta}\text{argmin} \sum_{i=1}^n \big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 + \sum_{j=1}^m \bigg(\frac{\theta_j}{\sigma_\theta}\bigg)^2 \\ &=\underset{\Theta}\text{argmin} \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \underbrace{\bigg(\frac{\sigma_D}{\sigma_\theta} \bigg)^2}_\lambda \sum_{j=1}^m \theta_j^2. \end{aligned}

For the last step, we multiply σD2\sigma_D^2, which is positive, to the equation. From the result above, λ\lambda is a hyperparameter that governs how much we would like to regularize the model, and one can interpret it as the ratio between the variances of data and parameters.


The content of this article is largely summarised from