Maximum Likelihood Consider a dataset D = { ( x 1 , y 1 ) , … , ( x n , y n ) } \mathcal{D} = \{(x_1, y_1), \dots, (x_n, y_n)\} D = { ( x 1 , y 1 ) , … , ( x n , y n ) } . We assume that y i y_i y i is a corrupted measurement of x i x_i x i with some noise ϵ ∼ N ( 0 , σ D 2 ) \epsilon \sim \mathcal{N}(0, \sigma_D^2) ϵ ∼ N ( 0 , σ D 2 ) , i.e.
y i = y ^ i + ϵ . y_i = \hat{y}_i + \epsilon. y i = y ^ i + ϵ . The goal of regression is to find a model that can capture the relationship between y ^ i \hat{y}_i y ^ i and x i x_i x i . Let Θ ^ = { θ 1 , … , θ m } \hat \Theta = \{\theta_1, \dots,\theta_m \} Θ ^ = { θ 1 , … , θ m } be reasonable parameters of this model. Using Maximum Likelihood Estimation (MLE) , this objective can be written as
Θ ^ = argmax Θ ∏ i = 1 n P ( y i ∣ x i ; Θ ) = argmax Θ ∑ i = 1 n ln P ( y i ∣ x i ; Θ ) \begin{aligned} \hat \Theta &= \underset{\Theta}\text{argmax} \prod_{i=1}^n P(y_i|x_i; \Theta) \\ &= \underset{\Theta}\text{argmax} \sum_{i=1}^n \ln P(y_i|x_i; \Theta) \end{aligned} Θ ^ = Θ argmax i = 1 ∏ n P ( y i ∣ x i ; Θ ) = Θ argmax i = 1 ∑ n ln P ( y i ∣ x i ; Θ ) where P ( y i ∣ x i ; Θ ^ ) P(y_i|x_i; \hat \Theta) P ( y i ∣ x i ; Θ ^ ) is the likelihood that we get such an observation y i y_i y i from x i x_i x i under a model with parameters Θ ^ \hat \Theta Θ ^ .
Mean-Squared Error Because we assume that y i y_i y i is normally distributed, the objective above is
Θ ^ = argmax Θ ∑ i = 1 n ln 1 σ D 2 π e − 1 2 ( y i − y ^ i σ D ) 2 = argmax Θ ∑ i = 1 n ln 1 σ D 2 π + ln e − 1 2 ( y i − y ^ i σ D ) 2 = argmax Θ ∑ i = 1 n ln 1 σ D 2 π − 1 2 ( y i − y ^ i σ D ) 2 = argmax Θ ∑ i = 1 n − 1 2 ( y i − y ^ i σ D ) 2 = argmin Θ ∑ i = 1 n 1 2 σ D 2 ( y i − y ^ i ) 2 = argmin Θ ∑ i = 1 n ( y i − y ^ i ) 2 . \begin{aligned} \hat{\Theta} &= \underset{\Theta}\text{argmax} \sum_{i=1}^n \ln \frac{1}{\sigma_D\sqrt{2\pi}} e^{ - \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2} \\ &= \underset{\Theta}\text{argmax} \sum_{i=1}^n \ln \frac{1}{\sigma_D\sqrt{2\pi}} + \ln e^{ - \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2} \\ &=\underset{\Theta}\text{argmax} \sum_{i=1}^n \cancel{\ln \frac{1}{\sigma_D\sqrt{2\pi}}} - \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 \\ &= \underset{\Theta}\text{argmax} \sum_{i=1}^n- \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 \\ &= \underset{\Theta}\text{argmin} \sum_{i=1}^n \cancel{\frac{1}{2\sigma_D^2}} \big({y_i - \hat{y}_i}\big)^2 \\ &= \underset{\Theta}\text{argmin} \sum_{i=1}^n \big(y_i - \hat{y}_i \big)^2. \end{aligned} Θ ^ = Θ argmax i = 1 ∑ n ln σ D 2 π 1 e − 2 1 ( σ D y i − y ^ i ) 2 = Θ argmax i = 1 ∑ n ln σ D 2 π 1 + ln e − 2 1 ( σ D y i − y ^ i ) 2 = Θ argmax i = 1 ∑ n ln σ D 2 π 1 − 2 1 ( σ D y i − y ^ i ) 2 = Θ argmax i = 1 ∑ n − 2 1 ( σ D y i − y ^ i ) 2 = Θ argmin i = 1 ∑ n 2 σ D 2 1 ( y i − y ^ i ) 2 = Θ argmin i = 1 ∑ n ( y i − y ^ i ) 2 . We cancel the two terms in the derivation because they do not depends on Θ \Theta Θ , hence no influence on the opitimization.
Maximum a Posteriori The MLE approach above consider only the likelihood term in Bayes' rule:
P ( Θ ∣ D ) ⏟ Posterior = P ( D ∣ Θ ) ⏞ Likelihood P ( Θ ) ⏞ Prior P ( D ) ⏟ Evidence \underbrace{P(\Theta| \mathcal{D})}_{\text{Posterior} } = \frac{ \overbrace{P(\mathcal{D}|\Theta)}^{\text{Likelihood}} \overbrace{P(\Theta)}^{\text{Prior}}}{ \underbrace{P(\mathcal{D})}_{\text{Evidence}} } Posterior P ( Θ ∣ D ) = Evidence P ( D ) P ( D ∣ Θ ) Likelihood P ( Θ ) Prior Because P ( D ) P(\mathcal{D}) P ( D ) does not depend on Θ \Theta Θ , the posterior is
P ( Θ ∣ D ) ∝ P ( D ∣ Θ ) P ( Θ ) . P(\Theta| \mathcal{D}) \propto P(\mathcal{D}|\Theta) P(\Theta). P ( Θ ∣ D ) ∝ P ( D ∣ Θ ) P ( Θ ) . Using Maximum a Posteriori (MAP) , one can find suitable parameters Θ ^ \hat{\Theta} Θ ^ by maximizing P ( θ ∣ D ) P(\theta|\mathcal{D}) P ( θ ∣ D ) :
Θ ^ = argmax Θ ( ∏ i = 1 n P ( y i ∣ x i ; Θ ) ) P ( Θ ) = argmax Θ ( ∑ i = 1 n ln P ( y i ∣ x i ; Θ ) ) + ln P ( Θ ) . \begin{aligned} \hat \Theta &= \underset{\Theta}\text{argmax} \bigg( \prod_{i=1}^n P(y_i|x_i; \Theta) \bigg) P(\Theta) \\ &= \underset{\Theta}\text{argmax} \bigg( \sum_{i=1}^n \ln P(y_i|x_i; \Theta) \bigg) + \ln P(\Theta). \end{aligned} Θ ^ = Θ argmax ( i = 1 ∏ n P ( y i ∣ x i ; Θ ) ) P ( Θ ) = Θ argmax ( i = 1 ∑ n ln P ( y i ∣ x i ; Θ ) ) + ln P ( Θ ) . Consider each parameter θ i ∼ N ( 0 , σ θ 2 ) \theta_i \sim \mathcal{N}(0, \sigma_{\theta}^2) θ i ∼ N ( 0 , σ θ 2 ) . The second term is then
ln P ( Θ ) = ln ∏ j = 1 m 1 σ θ 2 π e − 1 2 ( θ j σ θ ) 2 = ∑ j = 1 m ln 1 σ θ 2 π e − 1 2 ( θ j σ θ ) 2 . \begin{aligned} \ln P(\Theta) &= \ln \prod_{j=1}^{m} \frac{1}{\sigma_\theta \sqrt{2\pi}} e^{ - \frac{1}{2}\big(\frac{\theta_j}{\sigma_\theta}\big)^2} \\ & = \sum_{j=1}^m \ln \frac{1}{\sigma_\theta \sqrt{2\pi}} e^{ - \frac{1}{2}\big(\frac{\theta_j}{\sigma_\theta}\big)^2}. \end{aligned} ln P ( Θ ) = ln j = 1 ∏ m σ θ 2 π 1 e − 2 1 ( σ θ θ j ) 2 = j = 1 ∑ m ln σ θ 2 π 1 e − 2 1 ( σ θ θ j ) 2 . L_2 Regularizer (Weight Decay) Borrowing the intermediate derivation step of the likelihood term from the previous section and substituting the prior term into the optimization yield
Θ ^ = argmax Θ ∑ i = 1 n − 1 2 ( y i − y ^ i σ D ) 2 + ∑ j = 1 m ln 1 σ θ 2 π e − 1 2 ( θ j σ θ ) 2 = argmax Θ ∑ i = 1 n − 1 2 ( y i − y ^ i σ D ) 2 − 1 2 ∑ j = 1 m ( θ j σ θ ) 2 − m ln σ θ 2 π = argmin Θ ∑ i = 1 n ( y i − y ^ i σ D ) 2 + ∑ j = 1 m ( θ j σ θ ) 2 = argmin Θ ∑ i = 1 n ( y i − y ^ i ) 2 + ( σ D σ θ ) 2 ⏟ λ ∑ j = 1 m θ j 2 . \begin{aligned} \hat{\Theta} &= \underset{\Theta}\text{argmax} \sum_{i=1}^n- \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 + \sum_{j=1}^m \ln \frac{1}{\sigma_\theta \sqrt{2\pi}} e^{ - \frac{1}{2}\big(\frac{\theta_j}{\sigma_\theta}\big)^2} \\ &=\underset{\Theta}\text{argmax} \sum_{i=1}^n - \frac{1}{2}\big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 - \frac{1}{2} \sum_{j=1}^m \bigg(\frac{\theta_j}{\sigma_\theta}\bigg)^2 - \cancel{m \ln \sigma_\theta \sqrt{2\pi}}\\ &=\underset{\Theta}\text{argmin} \sum_{i=1}^n \big(\frac{y_i - \hat{y}_i}{\sigma_D}\big)^2 + \sum_{j=1}^m \bigg(\frac{\theta_j}{\sigma_\theta}\bigg)^2 \\ &=\underset{\Theta}\text{argmin} \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \underbrace{\bigg(\frac{\sigma_D}{\sigma_\theta} \bigg)^2}_\lambda \sum_{j=1}^m \theta_j^2. \end{aligned} Θ ^ = Θ argmax i = 1 ∑ n − 2 1 ( σ D y i − y ^ i ) 2 + j = 1 ∑ m ln σ θ 2 π 1 e − 2 1 ( σ θ θ j ) 2 = Θ argmax i = 1 ∑ n − 2 1 ( σ D y i − y ^ i ) 2 − 2 1 j = 1 ∑ m ( σ θ θ j ) 2 − m ln σ θ 2 π = Θ argmin i = 1 ∑ n ( σ D y i − y ^ i ) 2 + j = 1 ∑ m ( σ θ θ j ) 2 = Θ argmin i = 1 ∑ n ( y i − y ^ i ) 2 + λ ( σ θ σ D ) 2 j = 1 ∑ m θ j 2 . For the last step, we multiply σ D 2 \sigma_D^2 σ D 2 , which is positive, to the equation. From the result above, λ \lambda λ is a hyperparameter that governs how much we would like to regularize the model, and one can interpret it as the ratio between the variances of data and parameters.
Acknowledgements The content of this article is largely summarised from