Finding the underlying distribution that matches the data we observe is one of the most important aspect of machine learning. Without any constraint, this task seems to be impossible to solve; therefore, we need to also choose a set of assumptions, called inductive bias, that describe what characteristics of the solution (or hypothesis) that we prefer. One of such biases is Occam's Razor, which states that given two correct hypotheses the simpler one is preferred.
It might sound circulate, but one might ask "How can we determine which hypothesis is simple?". As we are interested in finding a distribution that aligns well with our observational data, we might look at the entropy of the distribution, which measures the uncertainty (or surprise) of the distribution. Given a set of statistics we can extract from the data, one might choose the distribution that has the highest entropy. This is Jaynes (1957)'s Principle of Maximum Entropy (MaxEnt). My interpretation to this principle is that more surprise indicates that we make fewer assumptions or less claims about the data.
Deriving Distributions with MaxEnt Before diving into the derivations, let's recall the definition of Entropy. Consider X X X a random variable coming from P X P_X P X whose domain X \mathcal{X} X can be discrete or continuous. We denote p X p_X p X to be the probably mass function or probability density function. The definition of Shannon's entropy is:
H ( P X ) : = − ∑ x ∈ X p X ( x ) log p X ( x ) . H(P_X) : = - \sum_{x \in \mathcal X} p_X(x) \log p_X(x). H ( P X ) : = − x ∈ X ∑ p X ( x ) log p X ( x ) . Given the statistics of the data such as its expectation μ \mu μ and variance σ 2 \sigma^2 σ 2 , our goal is to use MaxEnt to find p X p_X p X subject to the following constraints:
∑ x ∈ X p X ( x ) = 1 \sum_{x \in \mathcal{X}} p_X(x) = 1 ∑ x ∈ X p X ( x ) = 1 ∀ x ∈ X \forall x \in \mathcal{X} ∀ x ∈ X , p X ( x ) ≥ 0 p_X(x) \ge 0 p X ( x ) ≥ 0 (this condition would satisfy implicitly.)E p X [ X ] = μ E_{p_X}[X] = \mu E p X [ X ] = μ Var ( X ) = E p X [ X 2 ] − E p X [ X ] 2 = σ 2 \text{Var}(X) = E_{p_X}[X^2] - E_{p_X}[X]^2 = \sigma^2 Var ( X ) = E p X [ X 2 ] − E p X [ X ] 2 = σ 2 .Consider the support of the distribution X = { a , a + 1 , … , b − 1 , b } \mathcal{X} = \{a, a+1, \dots, b-1, b \} X = { a , a + 1 , … , b − 1 , b } where a , b ∈ Z a, b \in \mathbb{Z} a , b ∈ Z and b > a b > a b > a . In this case, we have only one constraint, namely ∑ x ∈ X p X ( x ) = 1 \sum_{x \in \mathcal X} p_X(x) = 1 ∑ x ∈ X p X ( x ) = 1 . Denote p x : = p X ( x ) , ∀ x ∈ X p_{x} := p_X(x), \forall x \in \mathcal X p x : = p X ( x ) , ∀ x ∈ X . Using Lagrange multipliers, we have
L ( { p x } , λ ) = ( − ∑ x p x log p x ) − λ ( ∑ x p x − 1 ) . \mathcal{L}(\{p_{x}\}, \lambda) = \bigg(- \sum_x p_{x} \log p_{x}\bigg) - \lambda \bigg(\sum_x p_{x} - 1\bigg). L ( { p x } , λ ) = ( − x ∑ p x log p x ) − λ ( x ∑ p x − 1 ) . Taking partial derivative with respect to p x p_x p x , we get
∂ L ( { p x } , λ ) ∂ p x = − 1 − log p x − λ = ! 0. \begin{aligned} \frac{\partial \mathcal{L}(\{p_{x}\}, \lambda)}{\partial p_x} &= - 1 - \log p_x - \lambda \\ &\overset{!}{=} 0. \end{aligned} ∂ p x ∂ L ( { p x } , λ ) = − 1 − log p x − λ = ! 0 . This implies that p x = exp ( − 1 − λ ) ≥ 0. p_x = \exp(-1 - \lambda) \ge 0. p x = exp ( − 1 − λ ) ≥ 0 . Furthermore, the second partial derivative respect to p x p_x p x is − 1 p x - \frac{1}{p_x} − p x 1 ; therefore, this is the maximum. Similarly, we can do the same for λ \lambda λ :
∂ L ( { p x } , λ ) ∂ λ = − ( ∑ x p x − 1 ) = ! 0. \begin{aligned} \frac{\partial \mathcal{L}(\{p_{x}\}, \lambda)}{\partial \lambda } &= - \bigg(\sum_x p_{x} - 1\bigg)\\ &\overset{!}{=} 0. \end{aligned} ∂ λ ∂ L ( { p x } , λ ) = − ( x ∑ p x − 1 ) = ! 0 . Noting that ∣ X ∣ = b − a + 1 |\mathcal X| = b - a + 1 ∣ X ∣ = b − a + 1 . Therefore, we get
∑ x p x = 1 ∑ x exp ( − 1 − λ ) = 1 exp ( − 1 − λ ) ∑ x 1 = 1 exp ( − 1 − λ ) ( b − a + 1 ) = 1 exp ( − 1 − λ ) = 1 b − a + 1 λ = − log ( 1 b − a + 1 ) − 1. \begin{aligned} \sum_x p_x &= 1 \\ \sum_x \exp(-1-\lambda) &= 1 \\ \exp(-1-\lambda) \sum_x 1 &= 1 \\ \exp(-1-\lambda) (b-a + 1) &= 1 \\ \\ \exp(-1-\lambda) &= \frac{1}{b-a + 1} \\ \lambda &= - \log\bigg( \frac{1}{b-a + 1}\bigg) - 1 . \end{aligned} x ∑ p x x ∑ exp ( − 1 − λ ) exp ( − 1 − λ ) x ∑ 1 exp ( − 1 − λ ) ( b − a + 1 ) exp ( − 1 − λ ) λ = 1 = 1 = 1 = 1 = b − a + 1 1 = − log ( b − a + 1 1 ) − 1 . Hence, ∀ x ∈ X \forall x \in \mathcal X ∀ x ∈ X we have
p x = exp ( − 1 − λ ) = exp ( − 1 + log ( 1 b − a + 1 ) + 1 ) = 1 b − a + 1 , \begin{aligned} p_x &= \exp(-1-\lambda) \\ &= \exp\bigg(\cancel{-1} +\log\bigg(\frac{1}{b- a + 1} \bigg) + \cancel{1} \bigg) \\ &= \frac{1}{b-a + 1}, \end{aligned} p x = exp ( − 1 − λ ) = exp ( − 1 + log ( b − a + 1 1 ) + 1 ) = b − a + 1 1 , which is the probability mass function of the discrete uniform distribution.
The derivation of this part is quite similar to the one above, expect that X = [ a , b ] \mathcal{X} = [a, b] X = [ a , b ] . We also need to rely on the continuous version of entropy:
H ( P X ) = − ∫ a b p ( x ) log p ( x ) d x , H(P_X) = - \int_a^b p(x) \log p(x) \text{d}{x}, H ( P X ) = − ∫ a b p ( x ) log p ( x ) d x , called differential entropy . Noting here unlike the discrete version, differential entropy can be negative.
With this, we can set up Lagrangian as follows:
L ( p ( x ) , λ ) = − ∫ a b p ( x ) log p ( x ) d x − λ ( ∫ a b p ( x ) d x − 1 ) . \mathcal{L}(p(x), \lambda) = - \int_a^b p(x) \log p(x) \text{d}x -\lambda \bigg( \int_a^b p(x) \text{d}x - 1\bigg). L ( p ( x ) , λ ) = − ∫ a b p ( x ) log p ( x ) d x − λ ( ∫ a b p ( x ) d x − 1 ) . Taking the functional derivative we get
δ δ p ( x ) L ( p ( x ) , λ 1 ) = δ δ p ( x ) ( − ∫ p ( x ) log p ( x ) dx − λ ∫ a b p ( x ) d x ) = − p ( x ) p ( x ) − log p ( x ) − λ = − 1 − log p ( x ) − λ = ! 0. \begin{aligned} \frac{\delta }{\delta p(x) } \mathcal{L}(p(x), \lambda_1) &= \frac{\delta }{\delta p(x) } \bigg( - \int p(x) \log p(x) \text{dx} - \lambda \int_a^b p(x) \text{d}x\bigg) \\ &= - \frac{p(x)}{p(x)} - \log p(x) - \lambda \\ &= - 1 - \log p(x) - \lambda \\ &\overset{!}{=} 0. \end{aligned} δ p ( x ) δ L ( p ( x ) , λ 1 ) = δ p ( x ) δ ( − ∫ p ( x ) log p ( x ) dx − λ ∫ a b p ( x ) d x ) = − p ( x ) p ( x ) − log p ( x ) − λ = − 1 − log p ( x ) − λ = ! 0 . This yields p ( x ) = exp ( − 1 − λ ) ≥ 0 p(x) = \exp(-1 - \lambda) \ge 0 p ( x ) = exp ( − 1 − λ ) ≥ 0 . We can also see that δ 2 L δ p ( x ) 2 = − 1 p ( x ) ≤ 0 \frac{\delta^2 \mathcal L }{\delta p(x)^2} = - \frac{1}{p(x)} \le 0 δ p ( x ) 2 δ 2 L = − p ( x ) 1 ≤ 0 . Therefore, we have
1 = ∫ p ( x ) d x = ∫ exp ( − 1 − λ ) d x = exp ( − 1 − λ ) ( b − a ) . \begin{aligned} 1 & = \int p(x) \text{d}x \\&= \int \exp(-1 - \lambda) \text{d}x \\ &= \exp(-1 - \lambda) (b - a). \end{aligned} 1 = ∫ p ( x ) d x = ∫ exp ( − 1 − λ ) d x = exp ( − 1 − λ ) ( b − a ) . We get p ( x ) = 1 b − a p(x) = \frac{1}{b-a} p ( x ) = b − a 1 , which is the density of the continuous uniform distribution .
Univariate Gaussian Distribution Now, let's consider that X = ( − ∞ , ∞ ) \mathcal{X} = (-\infty, \infty) X = ( − ∞ , ∞ ) and we want to find p ( x ) p(x) p ( x ) such that E p [ X ] = μ \mathbb{E}_p[X] = \mu E p [ X ] = μ and Var ( X ) = σ 2 \text{Var}(X) = \sigma^2 Var ( X ) = σ 2 . These two constraints can be rewritten as ∫ p ( x ) ( x − μ ) 2 dx = σ 2 \int p(x) (x - \mu)^2 \text{dx} = \sigma^2 ∫ p ( x ) ( x − μ ) 2 dx = σ 2 . Therefore, we have the following Lagrangian:
L ( p ( x ) , λ 1 , λ 2 ) = − ∫ X p ( x ) log p ( x ) d x − λ 1 ( ∫ X p ( x ) d x − 1 ) − λ 2 ( ∫ X ( x − μ ) 2 p ( x ) d x − σ 2 ) . \begin{aligned} \mathcal{L}(p(x), \lambda_1, \lambda_2) &= - \int_\mathcal{X} p(x) \log p(x) \text{d}x \\& \ \ \ \ \ -\lambda_1 \bigg( \int_\mathcal{X} p(x) \text{d}x - 1\bigg) \\ &\ \ \ \ \ - \lambda_2\bigg(\int_\mathcal{X} (x - \mu)^2 p(x) \text{d}x - \sigma^2 \bigg). \end{aligned} L ( p ( x ) , λ 1 , λ 2 ) = − ∫ X p ( x ) log p ( x ) d x − λ 1 ( ∫ X p ( x ) d x − 1 ) − λ 2 ( ∫ X ( x − μ ) 2 p ( x ) d x − σ 2 ) . Taking the functional derivative, we get
δ δ p ( x ) L ( p ( ⋅ ) , λ 1 , λ 2 ) = − 1 − ln p ( x ) − λ 1 − λ 2 ( x − μ ) 2 = ! 0. \begin{aligned} \frac{\delta}{\delta p(x)} \mathcal{L} (p(\cdot), \lambda_1, \lambda_2) &= - 1 - \ln p(x) - \lambda_1 - \lambda_2 (x-\mu)^2 \\ &\overset{!}{=} 0. \end{aligned} δ p ( x ) δ L ( p ( ⋅ ) , λ 1 , λ 2 ) = − 1 − ln p ( x ) − λ 1 − λ 2 ( x − μ ) 2 = ! 0 . Solving the equation leads to p ( x ) = exp ( λ 1 + λ 2 ( x − μ ) 2 − 1 ) ≥ 0 p(x) = \exp(\lambda_1 + \lambda_2(x-\mu)^2 - 1) \ge 0 p ( x ) = exp ( λ 1 + λ 2 ( x − μ ) 2 − 1 ) ≥ 0 and the second derivative is also negative. Based on the normalization constraint, we know that
1 = ∫ p ( x ) dx = ∫ exp ( λ 1 − λ 2 ( x − μ ) 2 − 1 ) dx = exp ( λ 1 − 1 ) ∫ exp ( λ 2 ( x − μ ⏟ ≜ z ) 2 ) dx = exp ( λ 1 − 1 ) ∫ exp ( λ 2 z 2 ) dz = exp ( λ 1 − 1 ) π − λ 2 , \begin{aligned} 1 &= \int p(x) \text{dx} \\ &= \int \exp(\lambda_1 - \lambda_2(x-\mu)^2 - 1) \text{dx} \\ &= \exp(\lambda_1 - 1) \int \exp(\lambda_2(\underbrace{x-\mu}_{\triangleq z})^2) \text{dx} \\ &= \exp(\lambda_1 - 1) \int \exp(\lambda_2z^2) \text{dz} \\ &= \exp(\lambda_1 - 1) \sqrt{\frac{\pi}{-\lambda_2}}, \end{aligned} 1 = ∫ p ( x ) dx = ∫ exp ( λ 1 − λ 2 ( x − μ ) 2 − 1 ) dx = exp ( λ 1 − 1 ) ∫ exp ( λ 2 ( ≜ z x − μ ) 2 ) dx = exp ( λ 1 − 1 ) ∫ exp ( λ 2 z 2 ) dz = exp ( λ 1 − 1 ) − λ 2 π , where the last step is the use of ∫ z 2 n exp ( − α z 2 ) dz = π α ( 2 n − 1 ) ! ( 2 α ) n \int z^{2n}\text{exp}(-\alpha z^2) \text{dz} = \sqrt{\frac{{\pi}}{\alpha}} \frac{(2n-1)!}{(2\alpha)^n} ∫ z 2 n exp ( − α z 2 ) dz = α π ( 2 α ) n ( 2 n − 1 ) ! (See Appendix XX for the derivation of the identity). Rearranging the equation, we arrive at
exp ( λ 1 − 1 ) = − λ 2 π . \exp(\lambda_1 - 1) = \sqrt\frac{-\lambda_2}{\pi}. exp ( λ 1 − 1 ) = π − λ 2 . Now, we consider the second constraint:
σ 2 = ∫ ( x − μ ) 2 p ( x ) dx = ∫ ( x − u ) 2 exp ( λ 1 + λ 2 ( x − u ) 2 − 1 ) dx = ∫ z 2 exp ( λ 1 + λ 2 z 2 − 1 ) dz = exp ( λ 1 − 1 ) ∫ z 2 exp ( λ 2 z 2 ) dz = exp ( λ 1 − 1 ) ( 1 − 2 λ 2 π − λ 2 ) . \begin{aligned} \sigma^2 &= \int (x-\mu)^2 p(x) \text{dx} \\ &= \int (x-u)^2 \exp(\lambda_1 + \lambda_2 (x-u)^2 - 1) \text{dx} \\ &= \int z^2 \exp(\lambda_1 + \lambda_2 z^2 - 1) \text{dz} \\ &= \exp(\lambda_1 - 1) \int z^2 \exp(\lambda_2 z^2) \text{dz} \\ &= \cancel{\exp(\lambda_1 - 1)} \bigg( \frac{1}{-2\lambda_2}\cancel{\sqrt \frac{\pi}{-\lambda_2}} \ \bigg). \end{aligned} σ 2 = ∫ ( x − μ ) 2 p ( x ) dx = ∫ ( x − u ) 2 exp ( λ 1 + λ 2 ( x − u ) 2 − 1 ) dx = ∫ z 2 exp ( λ 1 + λ 2 z 2 − 1 ) dz = exp ( λ 1 − 1 ) ∫ z 2 exp ( λ 2 z 2 ) dz = exp ( λ 1 − 1 ) ( − 2 λ 2 1 − λ 2 π ) . Therefore, we get λ 2 = − 1 2 σ 2 \lambda_2 = - \frac{1}{2\sigma^2} λ 2 = − 2 σ 2 1 . Plugging this back to the identiy of exp ( λ 1 − 1 ) \exp(\lambda_1 - 1) exp ( λ 1 − 1 ) yields
exp ( λ 1 − 1 ) = 1 2 π σ 2 λ 1 − 1 = ln 1 2 π σ 2 λ 1 = 1 + ln 1 2 π σ 2 . \begin{aligned} \exp(\lambda_1 - 1) &= \sqrt{\frac{1}{2 \pi \sigma^2 }} \\ \lambda_1 - 1 &= \ln \sqrt{\frac{1}{2 \pi \sigma^2 }} \\ \lambda_1 &= 1 + \ln \sqrt{\frac{1}{2 \pi \sigma^2 }}. \end{aligned} exp ( λ 1 − 1 ) λ 1 − 1 λ 1 = 2 π σ 2 1 = ln 2 π σ 2 1 = 1 + ln 2 π σ 2 1 . We have just solved the values of λ 1 \lambda_1 λ 1 and λ 2 \lambda_2 λ 2 . We are ready to bring everything to the equation of p ( x ) p(x) p ( x ) :
p ( x ) = exp ( λ 1 + λ 2 ( x − μ ) 2 − 1 ) = exp ( 1 + ln 1 2 π σ 2 − ( x − μ ) 2 2 σ 2 − 1 ) = 1 2 π σ 2 exp ( − ( x − μ ) 2 σ 2 ) , \begin{aligned} p(x) &= \exp(\lambda_1 + \lambda_2 (x - \mu)^2 -1) \\ &= \exp \bigg( \cancel{1} + \ln \sqrt\frac{1}{2\pi\sigma^2} - \frac{(x - \mu)^2}{2\sigma^2} -\cancel{1} \bigg) \\ &= \frac{1}{2\pi\sigma^2} \exp\bigg(- \frac{(x-\mu)^2}{\sigma^2} \bigg), \end{aligned} p ( x ) = exp ( λ 1 + λ 2 ( x − μ ) 2 − 1 ) = exp ( 1 + ln 2 π σ 2 1 − 2 σ 2 ( x − μ ) 2 − 1 ) = 2 π σ 2 1 exp ( − σ 2 ( x − μ ) 2 ) , which is the density of a univariate Gaussian distribution whose parameters are μ \mu μ and σ 2 \sigma^2 σ 2 .
Uniqueness of Distribution Given Constrainsts Now, consider an arbitrary support X \mathcal{X} X and constraints
E p X [ ϕ i ( X ) ] = α i , ∀ i ∈ { 1 , … , d } . \mathbb{E}_{p_X}[\phi_i(X)] = \alpha_i, \forall i \in \{1, \dots, d \}. E p X [ ϕ i ( X ) ] = α i , ∀ i ∈ { 1 , … , d } . We have the following Lagrangian
L ( p , λ , ( γ i ) i ) = − ∫ X p ( x ) log p ( x ) d x − λ ∫ X p ( x ) d x − ∑ i = 1 d γ i ( ∫ X p ( x ) ϕ i ( x ) d x − α i ) \begin{aligned} \mathcal{L}(p, \lambda, (\gamma_i)_i) &= - \int_\mathcal{X}p(x) \log p(x) \text{d}x - \lambda \int_\mathcal{X} p(x) \text{d}x - \sum_{i=1}^d \gamma_i \bigg( \int_\mathcal{X} p(x)\phi_i (x) \text{d}x - \alpha_i\bigg) \end{aligned} L ( p , λ , ( γ i ) i ) = − ∫ X p ( x ) log p ( x ) d x − λ ∫ X p ( x ) d x − i = 1 ∑ d γ i ( ∫ X p ( x ) ϕ i ( x ) d x − α i ) Computing the functional derivative yields
δ δ p ( x ) L ( p , λ , ( γ i ) i ) = − 1 − p ( x ) − λ − ∑ i = 1 d γ i ϕ i ( x ) = ! 0. \begin{aligned} \frac{\delta}{\delta p(x)} \mathcal{L}(p, \lambda, (\gamma_i)_i) &= -1 - p(x) - \lambda - \sum_{i=1}^d \gamma_i \phi_i(x) \\ &\overset{!}{=} 0. \end{aligned} δ p ( x ) δ L ( p , λ , ( γ i ) i ) = − 1 − p ( x ) − λ − i = 1 ∑ d γ i ϕ i ( x ) = ! 0 . Thus, we have p ( x ) = exp ( − λ − ∑ i = 1 d γ i ϕ i ( x ) − 1 ) . p(x) = \exp(- \lambda - \sum_{i=1}^d \gamma_i \phi_i(x) - 1). p ( x ) = exp ( − λ − ∑ i = 1 d γ i ϕ i ( x ) − 1 ) . Let's define
P : = { p ^ X : E p ^ X [ ϕ i ] = α i , ∀ i ∈ { 1 , … , d } } \mathcal{P} := \{ \hat{p}_X: \mathbb{E}_{\hat p_X}[\phi_i] = \alpha_i, \forall i \in \{ 1, \dots, d \} \} P : = { p ^ X : E p ^ X [ ϕ i ] = α i , ∀ i ∈ { 1 , … , d } } H ( p ^ X ) = − ∫ X p ^ ( x ) log p ^ ( x ) d x = − ∫ X p ^ ( x ) log p ^ ( x ) p ( x ) p ( x ) d x = ∫ X p ^ ( x ) log p ^ ( x ) p ( x ) d x ⏟ − D KL ( p ^ X ∥ p X ) − ∫ X p ^ ( x ) log p ( x ) d x = D KL ( p ^ X ∥ p X ) − ∫ X p ^ ( x ) log p ( x ) d x ≤ ( ⋆ ) − ∫ X p ^ ( x ) log p ( x ) d x = − [ − λ ∫ X p ^ ( x ) d x − ∑ i = 1 d ( ∫ X p ^ ( x ) γ i ϕ i ( x ) d x ) − ∫ X p ^ ( x ) ] = − [ − λ ∫ X p ( x ) d x − ∑ i = 1 d ( ∫ X p ( x ) γ i ϕ i ( x ) d x ) − ∫ X p ( x ) ] = − [ ∫ X p ( x ) ( − λ − ∑ i = 1 d γ i ϕ i ( x ) − 1 ) ⏟ = log ( p ( x ) ) d x ] = H ( p X ) . \begin{aligned} H(\hat{p}_X) &= - \int_\mathcal{X} \hat{p}(x) \log \hat p(x) \text{d}x \\ &=- \int_\mathcal{X} \hat{p}(x) \log \hat p(x) \frac{p(x)}{p(x)} \text{d}x \\ &=\underbrace{\int_\mathcal{X} \hat{p}(x) \log \frac{\hat{p}(x)}{p(x)} \text{d}x}_{-D_\text{KL}(\hat p_X \| p_X)} - \int_\mathcal{X} \hat{p}(x)\log p(x) \text{d}x \\&= D_\text{KL}(\hat p_X \| p_X) - \int_\mathcal{X} \hat{p}(x)\log p(x) \text{d}x \\&\overset{(\star)}{\le} -\int_\mathcal{X} \hat{p}(x)\log p(x) \text{d}x \\&= - \bigg[ - \lambda \int_\mathcal{X} \hat{p}(x) \text{d}x - \sum_{i=1}^d \bigg( \int_\mathcal{X} \hat{p}(x) \gamma_i \phi_i(x) \text{d}x \bigg) - \int_\mathcal{X} \hat{p}(x) \bigg ] \\&= - \bigg [-\lambda \int_\mathcal{X} p(x) \text{d}x - \sum_{i=1}^d \bigg( \int_\mathcal{X} p (x) \gamma_i \phi_i(x) \text{d}x \bigg) - \int_\mathcal{X} p(x) \bigg] \\&= - \bigg[ \int_\mathcal{X} p(x) \underbrace{\bigg(-\lambda - \sum_{i=1}^d \gamma_i \phi_i (x) - 1 \bigg)}_{=\log(p(x))} \text{d}x \bigg ]\\ &= H(p_X). \end{aligned} H ( p ^ X ) = − ∫ X p ^ ( x ) log p ^ ( x ) d x = − ∫ X p ^ ( x ) log p ^ ( x ) p ( x ) p ( x ) d x = − D KL ( p ^ X ∥ p X ) ∫ X p ^ ( x ) log p ( x ) p ^ ( x ) d x − ∫ X p ^ ( x ) log p ( x ) d x = D KL ( p ^ X ∥ p X ) − ∫ X p ^ ( x ) log p ( x ) d x ≤ ( ⋆ ) − ∫ X p ^ ( x ) log p ( x ) d x = − [ − λ ∫ X p ^ ( x ) d x − i = 1 ∑ d ( ∫ X p ^ ( x ) γ i ϕ i ( x ) d x ) − ∫ X p ^ ( x ) ] = − [ − λ ∫ X p ( x ) d x − i = 1 ∑ d ( ∫ X p ( x ) γ i ϕ i ( x ) d x ) − ∫ X p ( x ) ] = − [ ∫ X p ( x ) = l o g ( p ( x ) ) ( − λ − i = 1 ∑ d γ i ϕ i ( x ) − 1 ) d x ] = H ( p X ) . Appendix While writing this article, I have consulted the resources below:
John Duchi's Lecture Notes on Statistics 311/Electrical Engineering 377 Sam Finlayson's blog Aarti Singh and Min Xu's Lecture Note: Maximum Entropy Distributions and Exponential Family .On a side note, there are also an interesting connection between MaxEnt and maximum likelihood , which I leave for exploring in the future.