HomePublicationsProjectsBlogAbout

Principle of Maximum Entropy

December 27, 2020

Finding the underlying distribution that matches the data we observe is one of the most important aspect of machine learning. Without any constraint, this task seems to be impossible to solve; therefore, we need to also choose a set of assumptions, called inductive bias, that describe what characteristics of the solution (or hypothesis) that we prefer. One of such biases is Occam's Razor, which states that given two correct hypotheses the simpler one is preferred.

It might sound circulate, but one might ask "How can we determine which hypothesis is simple?". As we are interested in finding a distribution that aligns well with our observational data, we might look at the entropy of the distribution, which measures the uncertainty (or surprise) of the distribution. Given a set of statistics we can extract from the data, one might choose the distribution that has the highest entropy. This is Jaynes (1957)'s Principle of Maximum Entropy (MaxEnt). My interpretation to this principle is that more surprise indicates that we make fewer assumptions or less claims about the data.

Deriving Distributions with MaxEnt

Before diving into the derivations, let's recall the definition of Entropy. Consider XX a random variable coming from PXP_X whose domain X\mathcal{X} can be discrete or continuous. We denote pXp_X to be the probably mass function or probability density function. The definition of Shannon's entropy is:

H(PX):=xXpX(x)logpX(x).H(P_X) : = - \sum_{x \in \mathcal X} p_X(x) \log p_X(x).

Given the statistics of the data such as its expectation μ\mu and variance σ2\sigma^2, our goal is to use MaxEnt to find pXp_X subject to the following constraints:

  • xXpX(x)=1\sum_{x \in \mathcal{X}} p_X(x) = 1
  • xX\forall x \in \mathcal{X}, pX(x)0p_X(x) \ge 0 (this condition would satisfy implicitly.)
  • EpX[X]=μE_{p_X}[X] = \mu
  • Var(X)=EpX[X2]EpX[X]2=σ2\text{Var}(X) = E_{p_X}[X^2] - E_{p_X}[X]^2 = \sigma^2.

Discrete Uniform Distribution

Consider the support of the distribution X={a,a+1,,b1,b}\mathcal{X} = \{a, a+1, \dots, b-1, b \} where a,bZa, b \in \mathbb{Z} and b>ab > a. In this case, we have only one constraint, namely xXpX(x)=1\sum_{x \in \mathcal X} p_X(x) = 1. Denote px:=pX(x),xXp_{x} := p_X(x), \forall x \in \mathcal X. Using Lagrange multipliers, we have

L({px},λ)=(xpxlogpx)λ(xpx1).\mathcal{L}(\{p_{x}\}, \lambda) = \bigg(- \sum_x p_{x} \log p_{x}\bigg) - \lambda \bigg(\sum_x p_{x} - 1\bigg).

Taking partial derivative with respect to pxp_x, we get

L({px},λ)px=1logpxλ=!0.\begin{aligned} \frac{\partial \mathcal{L}(\{p_{x}\}, \lambda)}{\partial p_x} &= - 1 - \log p_x - \lambda \\ &\overset{!}{=} 0. \end{aligned}

This implies that px=exp(1λ)0.p_x = \exp(-1 - \lambda) \ge 0. Furthermore, the second partial derivative respect to pxp_x is 1px- \frac{1}{p_x}; therefore, this is the maximum. Similarly, we can do the same for λ\lambda:

L({px},λ)λ=(xpx1)=!0.\begin{aligned} \frac{\partial \mathcal{L}(\{p_{x}\}, \lambda)}{\partial \lambda } &= - \bigg(\sum_x p_{x} - 1\bigg)\\ &\overset{!}{=} 0. \end{aligned}

Noting that X=ba+1|\mathcal X| = b - a + 1. Therefore, we get

xpx=1xexp(1λ)=1exp(1λ)x1=1exp(1λ)(ba+1)=1exp(1λ)=1ba+1λ=log(1ba+1)1.\begin{aligned} \sum_x p_x &= 1 \\ \sum_x \exp(-1-\lambda) &= 1 \\ \exp(-1-\lambda) \sum_x 1 &= 1 \\ \exp(-1-\lambda) (b-a + 1) &= 1 \\ \\ \exp(-1-\lambda) &= \frac{1}{b-a + 1} \\ \lambda &= - \log\bigg( \frac{1}{b-a + 1}\bigg) - 1 . \end{aligned}

Hence, xX\forall x \in \mathcal X we have

px=exp(1λ)=exp(1+log(1ba+1)+1)=1ba+1,\begin{aligned} p_x &= \exp(-1-\lambda) \\ &= \exp\bigg(\cancel{-1} +\log\bigg(\frac{1}{b- a + 1} \bigg) + \cancel{1} \bigg) \\ &= \frac{1}{b-a + 1}, \end{aligned}

which is the probability mass function of the discrete uniform distribution.

Continuous Uniform Distribution

The derivation of this part is quite similar to the one above, expect that X=[a,b]\mathcal{X} = [a, b]. We also need to rely on the continuous version of entropy:

H(PX)=abp(x)logp(x)dx,H(P_X) = - \int_a^b p(x) \log p(x) \text{d}{x},

called differential entropy. Noting here unlike the discrete version, differential entropy can be negative.

With this, we can set up Lagrangian as follows:

L(p(x),λ)=abp(x)logp(x)dxλ(abp(x)dx1).\mathcal{L}(p(x), \lambda) = - \int_a^b p(x) \log p(x) \text{d}x -\lambda \bigg( \int_a^b p(x) \text{d}x - 1\bigg).

Taking the functional derivative we get

δδp(x)L(p(x),λ1)=δδp(x)(p(x)logp(x)dxλabp(x)dx)=p(x)p(x)logp(x)λ=1logp(x)λ=!0.\begin{aligned} \frac{\delta }{\delta p(x) } \mathcal{L}(p(x), \lambda_1) &= \frac{\delta }{\delta p(x) } \bigg( - \int p(x) \log p(x) \text{dx} - \lambda \int_a^b p(x) \text{d}x\bigg) \\ &= - \frac{p(x)}{p(x)} - \log p(x) - \lambda \\ &= - 1 - \log p(x) - \lambda \\ &\overset{!}{=} 0. \end{aligned}

This yields p(x)=exp(1λ)0p(x) = \exp(-1 - \lambda) \ge 0. We can also see that δ2Lδp(x)2=1p(x)0\frac{\delta^2 \mathcal L }{\delta p(x)^2} = - \frac{1}{p(x)} \le 0. Therefore, we have

1=p(x)dx=exp(1λ)dx=exp(1λ)(ba).\begin{aligned} 1 & = \int p(x) \text{d}x \\&= \int \exp(-1 - \lambda) \text{d}x \\ &= \exp(-1 - \lambda) (b - a). \end{aligned}

We get p(x)=1bap(x) = \frac{1}{b-a} , which is the density of the continuous uniform distribution.

Univariate Gaussian Distribution

Now, let's consider that X=(,)\mathcal{X} = (-\infty, \infty) and we want to find p(x)p(x) such that Ep[X]=μ\mathbb{E}_p[X] = \mu and Var(X)=σ2\text{Var}(X) = \sigma^2. These two constraints can be rewritten as p(x)(xμ)2dx=σ2\int p(x) (x - \mu)^2 \text{dx} = \sigma^2. Therefore, we have the following Lagrangian:

L(p(x),λ1,λ2)=Xp(x)logp(x)dx     λ1(Xp(x)dx1)     λ2(X(xμ)2p(x)dxσ2).\begin{aligned} \mathcal{L}(p(x), \lambda_1, \lambda_2) &= - \int_\mathcal{X} p(x) \log p(x) \text{d}x \\& \ \ \ \ \ -\lambda_1 \bigg( \int_\mathcal{X} p(x) \text{d}x - 1\bigg) \\ &\ \ \ \ \ - \lambda_2\bigg(\int_\mathcal{X} (x - \mu)^2 p(x) \text{d}x - \sigma^2 \bigg). \end{aligned}

Taking the functional derivative, we get

δδp(x)L(p(),λ1,λ2)=1lnp(x)λ1λ2(xμ)2=!0.\begin{aligned} \frac{\delta}{\delta p(x)} \mathcal{L} (p(\cdot), \lambda_1, \lambda_2) &= - 1 - \ln p(x) - \lambda_1 - \lambda_2 (x-\mu)^2 \\ &\overset{!}{=} 0. \end{aligned}

Solving the equation leads to p(x)=exp(λ1+λ2(xμ)21)0p(x) = \exp(\lambda_1 + \lambda_2(x-\mu)^2 - 1) \ge 0 and the second derivative is also negative. Based on the normalization constraint, we know that

1=p(x)dx=exp(λ1λ2(xμ)21)dx=exp(λ11)exp(λ2(xμz)2)dx=exp(λ11)exp(λ2z2)dz=exp(λ11)πλ2,\begin{aligned} 1 &= \int p(x) \text{dx} \\ &= \int \exp(\lambda_1 - \lambda_2(x-\mu)^2 - 1) \text{dx} \\ &= \exp(\lambda_1 - 1) \int \exp(\lambda_2(\underbrace{x-\mu}_{\triangleq z})^2) \text{dx} \\ &= \exp(\lambda_1 - 1) \int \exp(\lambda_2z^2) \text{dz} \\ &= \exp(\lambda_1 - 1) \sqrt{\frac{\pi}{-\lambda_2}}, \end{aligned}

where the last step is the use of z2nexp(αz2)dz=πα(2n1)!(2α)n\int z^{2n}\text{exp}(-\alpha z^2) \text{dz} = \sqrt{\frac{{\pi}}{\alpha}} \frac{(2n-1)!}{(2\alpha)^n} (See Appendix XX for the derivation of the identity). Rearranging the equation, we arrive at

exp(λ11)=λ2π.\exp(\lambda_1 - 1) = \sqrt\frac{-\lambda_2}{\pi}.

Now, we consider the second constraint:

σ2=(xμ)2p(x)dx=(xu)2exp(λ1+λ2(xu)21)dx=z2exp(λ1+λ2z21)dz=exp(λ11)z2exp(λ2z2)dz=exp(λ11)(12λ2πλ2 ).\begin{aligned} \sigma^2 &= \int (x-\mu)^2 p(x) \text{dx} \\ &= \int (x-u)^2 \exp(\lambda_1 + \lambda_2 (x-u)^2 - 1) \text{dx} \\ &= \int z^2 \exp(\lambda_1 + \lambda_2 z^2 - 1) \text{dz} \\ &= \exp(\lambda_1 - 1) \int z^2 \exp(\lambda_2 z^2) \text{dz} \\ &= \cancel{\exp(\lambda_1 - 1)} \bigg( \frac{1}{-2\lambda_2}\cancel{\sqrt \frac{\pi}{-\lambda_2}} \ \bigg). \end{aligned}

Therefore, we get λ2=12σ2\lambda_2 = - \frac{1}{2\sigma^2}. Plugging this back to the identiy of exp(λ11)\exp(\lambda_1 - 1) yields

exp(λ11)=12πσ2λ11=ln12πσ2λ1=1+ln12πσ2.\begin{aligned} \exp(\lambda_1 - 1) &= \sqrt{\frac{1}{2 \pi \sigma^2 }} \\ \lambda_1 - 1 &= \ln \sqrt{\frac{1}{2 \pi \sigma^2 }} \\ \lambda_1 &= 1 + \ln \sqrt{\frac{1}{2 \pi \sigma^2 }}. \end{aligned}

We have just solved the values of λ1\lambda_1 and λ2\lambda_2. We are ready to bring everything to the equation of p(x)p(x):

p(x)=exp(λ1+λ2(xμ)21)=exp(1+ln12πσ2(xμ)22σ21)=12πσ2exp((xμ)2σ2),\begin{aligned} p(x) &= \exp(\lambda_1 + \lambda_2 (x - \mu)^2 -1) \\ &= \exp \bigg( \cancel{1} + \ln \sqrt\frac{1}{2\pi\sigma^2} - \frac{(x - \mu)^2}{2\sigma^2} -\cancel{1} \bigg) \\ &= \frac{1}{2\pi\sigma^2} \exp\bigg(- \frac{(x-\mu)^2}{\sigma^2} \bigg), \end{aligned}

which is the density of a univariate Gaussian distribution whose parameters are μ\mu and σ2\sigma^2.

Uniqueness of Distribution Given Constrainsts

Now, consider an arbitrary support X\mathcal{X} and constraints EpX[ϕi(X)]=αi,i{1,,d}.\mathbb{E}_{p_X}[\phi_i(X)] = \alpha_i, \forall i \in \{1, \dots, d \}. We have the following Lagrangian

L(p,λ,(γi)i)=Xp(x)logp(x)dxλXp(x)dxi=1dγi(Xp(x)ϕi(x)dxαi)\begin{aligned} \mathcal{L}(p, \lambda, (\gamma_i)_i) &= - \int_\mathcal{X}p(x) \log p(x) \text{d}x - \lambda \int_\mathcal{X} p(x) \text{d}x - \sum_{i=1}^d \gamma_i \bigg( \int_\mathcal{X} p(x)\phi_i (x) \text{d}x - \alpha_i\bigg) \end{aligned}

Computing the functional derivative yields

δδp(x)L(p,λ,(γi)i)=1p(x)λi=1dγiϕi(x)=!0.\begin{aligned} \frac{\delta}{\delta p(x)} \mathcal{L}(p, \lambda, (\gamma_i)_i) &= -1 - p(x) - \lambda - \sum_{i=1}^d \gamma_i \phi_i(x) \\ &\overset{!}{=} 0. \end{aligned}

Thus, we have p(x)=exp(λi=1dγiϕi(x)1).p(x) = \exp(- \lambda - \sum_{i=1}^d \gamma_i \phi_i(x) - 1). Let's define

P:={p^X:Ep^X[ϕi]=αi,i{1,,d}}\mathcal{P} := \{ \hat{p}_X: \mathbb{E}_{\hat p_X}[\phi_i] = \alpha_i, \forall i \in \{ 1, \dots, d \} \} H(p^X)=Xp^(x)logp^(x)dx=Xp^(x)logp^(x)p(x)p(x)dx=Xp^(x)logp^(x)p(x)dxDKL(p^XpX)Xp^(x)logp(x)dx=DKL(p^XpX)Xp^(x)logp(x)dx()Xp^(x)logp(x)dx=[λXp^(x)dxi=1d(Xp^(x)γiϕi(x)dx)Xp^(x)]=[λXp(x)dxi=1d(Xp(x)γiϕi(x)dx)Xp(x)]=[Xp(x)(λi=1dγiϕi(x)1)=log(p(x))dx]=H(pX).\begin{aligned} H(\hat{p}_X) &= - \int_\mathcal{X} \hat{p}(x) \log \hat p(x) \text{d}x \\ &=- \int_\mathcal{X} \hat{p}(x) \log \hat p(x) \frac{p(x)}{p(x)} \text{d}x \\ &=\underbrace{\int_\mathcal{X} \hat{p}(x) \log \frac{\hat{p}(x)}{p(x)} \text{d}x}_{-D_\text{KL}(\hat p_X \| p_X)} - \int_\mathcal{X} \hat{p}(x)\log p(x) \text{d}x \\&= D_\text{KL}(\hat p_X \| p_X) - \int_\mathcal{X} \hat{p}(x)\log p(x) \text{d}x \\&\overset{(\star)}{\le} -\int_\mathcal{X} \hat{p}(x)\log p(x) \text{d}x \\&= - \bigg[ - \lambda \int_\mathcal{X} \hat{p}(x) \text{d}x - \sum_{i=1}^d \bigg( \int_\mathcal{X} \hat{p}(x) \gamma_i \phi_i(x) \text{d}x \bigg) - \int_\mathcal{X} \hat{p}(x) \bigg ] \\&= - \bigg [-\lambda \int_\mathcal{X} p(x) \text{d}x - \sum_{i=1}^d \bigg( \int_\mathcal{X} p (x) \gamma_i \phi_i(x) \text{d}x \bigg) - \int_\mathcal{X} p(x) \bigg] \\&= - \bigg[ \int_\mathcal{X} p(x) \underbrace{\bigg(-\lambda - \sum_{i=1}^d \gamma_i \phi_i (x) - 1 \bigg)}_{=\log(p(x))} \text{d}x \bigg ]\\ &= H(p_X). \end{aligned}

Appendix

While writing this article, I have consulted the resources below:

  1. John Duchi's Lecture Notes on Statistics 311/Electrical Engineering 377
  2. Sam Finlayson's blog
  3. Aarti Singh and Min Xu's Lecture Note: Maximum Entropy Distributions and Exponential Family.

On a side note, there are also an interesting connection between MaxEnt and maximum likelihood, which I leave for exploring in the future.