Finding the underlying distribution that matches the data we observe is one of the most important aspect of machine learning. Without any constraint, this task seems to be impossible to solve; therefore, we need to also choose a set of assumptions, called inductive bias, that describe what characteristics of the solution (or hypothesis) that we prefer. One of such biases is Occam's Razor, which states that given two correct hypotheses the simpler one is preferred.
It might sound circulate, but one might ask "How can we determine which hypothesis is simple?". As we are interested in finding a distribution that aligns well with our observational data, we might look at the entropy of the distribution, which measures the uncertainty (or surprise) of the distribution. Given a set of statistics we can extract from the data, one might choose the distribution that has the highest entropy. This is Jaynes (1957)'s Principle of Maximum Entropy (MaxEnt). My interpretation to this principle is that more surprise indicates that we make fewer assumptions or less claims about the data.
Deriving Distributions with MaxEnt
Before diving into the derivations, let's recall the definition of Entropy. Consider a random variable coming from whose domain can be discrete or continuous. We denote to be the probably mass function or probability density function. The definition of Shannon's entropy is:
Given the statistics of the data such as its expectation and variance , our goal is to use MaxEnt to find subject to the following constraints:
- , (this condition would satisfy implicitly.)
Discrete Uniform Distribution
Consider the support of the distribution where and . In this case, we have only one constraint, namely . Denote . Using Lagrange multipliers, we have
Taking partial derivative with respect to , we get
This implies that Furthermore, the second partial derivative respect to is ; therefore, this is the maximum. Similarly, we can do the same for :
Noting that . Therefore, we get
Hence, we have
which is the probability mass function of the discrete uniform distribution.
Continuous Uniform Distribution
The derivation of this part is quite similar to the one above, expect that . We also need to rely on the continuous version of entropy:
called differential entropy. Noting here unlike the discrete version, differential entropy can be negative.
With this, we can set up Lagrangian as follows:
Taking the functional derivative we get
This yields . We can also see that . Therefore, we have
We get , which is the density of the continuous uniform distribution.
Univariate Gaussian Distribution
Now, let's consider that and we want to find such that and . These two constraints can be rewritten as . Therefore, we have the following Lagrangian:
Taking the functional derivative, we get
Solving the equation leads to and the second derivative is also negative. Based on the normalization constraint, we know that
where the last step is the use of (See Appendix XX for the derivation of the identity). Rearranging the equation, we arrive at
Now, we consider the second constraint:
Therefore, we get . Plugging this back to the identiy of yields
We have just solved the values of and . We are ready to bring everything to the equation of :
which is the density of a univariate Gaussian distribution whose parameters are and .
Uniqueness of Distribution Given Constrainsts
Now, consider an arbitrary support and constraints
We have the following Lagrangian
Computing the functional derivative yields
Thus, we have Let's define
While writing this article, I have consulted the resources below:
- John Duchi's Lecture Notes on Statistics 311/Electrical Engineering 377
- Sam Finlayson's blog
- Aarti Singh and Min Xu's Lecture Note: Maximum Entropy Distributions and Exponential Family.
On a side note, there are also an interesting connection between MaxEnt and maximum likelihood, which I leave for exploring in the future.