Exponential Family of Distributions

May 15, 2020

Exponential family is a large class of probabilistic distributions, both discrete and continuous. Some of these distributions include Gaussian and Bernouli distributions. As the name suggested, distributions in this famility are in a generic exponential form.

Consider XX a random variable from a exponential family distribution. Its probability mass function (if XX is discrete) or probability density function (if it continuous) is written as

p(xθ)=f(x)exp(η(θ)ϕ(x)+g(θ)),p(x|\theta) = f(x) \exp( \eta(\theta) \cdot \phi(x) + g(\theta)),


  • ϕ(x)\phi(x) is XX's sufficient statistic(s);
  • η(θ)\eta(\theta) is the natural parameter(s) of the distribution;

    • θ\theta is the parameter(s) of the distribution;
  • g(θ)g(\theta) is the log-partition function, which act as a normalizer;
  • f(x)f(x) is a function that depends on xx.
p(xθ)=p(xθ)p(xθ)dx=g(θ)exp(η(θ)x)g(θ)f(x)exp(η(θ)x).\begin{aligned} p(x|\theta) &= \frac{p(x| \theta)}{ \int p(x| \theta) \text{dx}} \\ &= \frac{ \cancel{g(\theta)} \exp(\eta(\theta)\cdot x) }{ \cancel{g(\theta)} \int f(x) \exp(\eta(\theta)\cdot x) }. \end{aligned}

Some Distributions in Exponential Family

Bernoulli Distribution

Bernoulli distribution has one paramerter, called p[0,1]p \in [0, 1]. Its sample space Ω={0,1}\Omega = \{0, 1\}, e.g. coin tossing. Its probablity mass function is usally written in the following form:

p(xp)=px(1p)1x.p(x|p) = p^x (1-p)^{1-x}.

We can rewrite the above equation using the exponential-logarithm trick:

p(xp)=exp(logpx+log(1p)(1x))=exp(xlogp+(1x)log(1p))=exp(xlogpxlog(1p)+log(1p))=exp(xlogp(1p)+log(1p)).\begin{aligned} p(x|p) &= \exp(\log p^x + \log (1-p)^{(1-x)}) \\ &= \exp(x\log p + (1-x)\log (1-p)) \\ &= \exp(x\log p -x\log (1-p) + \log (1-p))\\ &= \exp\bigg(x\log \frac{p}{(1-p)} + \log (1-p)\bigg). \end{aligned}

So, we can conclude the followings:

  • f(x)=1f(x) = 1;
  • ϕ(x)=x\phi(x) = x;
  • η(p)=log(p1p)\eta(p) = \log \bigg( \frac{p}{1-p} \bigg);
  • g(p)=log(1p)g(p) = \log (1-p).

Gaussian Distribution

Let's turn to an exponential family distribution for continuous random variables. The most important one is the Gaussian distribution. For univariate settings, i.e. xRx \in \Reals, the density is

p(xμ,σ2)=12πσ2exp((xμ)22σ2)=12πσ2exp((x22xμ+μ2)2σ2)=12πexp(xμσ2x22σ2μ22σ2logσ),\begin{aligned} p(x| \mu, \sigma^2) &= \frac{1}{\sqrt{2\pi\sigma^2}}\exp \bigg(-\frac{(x-\mu)^2}{2\sigma^2}\bigg) \\ &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp \bigg(-\frac{(x^2 -2x\mu+\mu^2)}{2\sigma^2}\bigg) \\ &= \frac{1}{\sqrt{2\pi}} \exp \bigg( \frac{x\mu}{\sigma^2} -\frac{x^2}{2\sigma^2} - \frac{\mu^2}{2\sigma^2 }- \log \sigma \bigg), \end{aligned}


  • f(x)=12πf(x) = \frac{1}{\sqrt{2\pi}};
  • ϕ(x)=(x,x2)T\phi(x) = (x, x^2)^T
  • η(θ)=(μσ2,1σ2)T\eta(\bf \theta) = (\frac{\mu}{\sigma^2}, -\frac{1}{\sigma^2} )^T;
  • g(θ)=μ22σ2logσg(\bf \theta) = - \frac{\mu^2}{2\sigma^2} - \log \sigma.

Cumulant: Moment Generating Function

Let η=η(θ)\eta = \eta(\theta). The cumulant A(η)g(θ)A(\eta) \equiv -g(\theta). In the following, we are going to show that we can get the moment parameter of Bernoulli and Gaussian distributions from A(η)A(\eta).

Bernoulli Distribution

Let recall that g(θ)=log(1p)g(\theta) = \log(1-p) for Bernouli distributions. We have

A(η)=log(1p).A(\eta) = -\log (1-p).

After rearranging the equation, it yields

p=11+eη    A(η)=log(1+eη).p = \frac{1}{1+e^{-\eta}} \implies A(\eta) = \log(1+e^\eta).

Taking the first and second derivative, we have

A(η)=11+eη=pA(η)=(11+eη)p(eη1+eη)1p.\begin{aligned} A'(\eta) &= \frac{1}{1+e^{-\eta}} = p \\ A''(\eta) &= \underbrace{\bigg(\frac{1}{1+e^{-\eta}}\bigg)}_{p} \underbrace{\bigg( \frac{e^{-\eta}}{1+e^{-\eta}} \bigg)}_{1-p}. \end{aligned}

Therefore, we recover the mean pp and the variance p(1p)p(1-p) of Bernoulli distributions.

Noting, you might notice that the function transformating η\eta to pp looks familiar; indeed, this is the sigmoid function! In generalized linear models, it is the link function.

Gaussian Distribution

Recall g(θ)g(\theta) of the Gaussian distribution. Let (η1,η2)Tη(θ)\mathbf (\eta_1, \eta_2)^T \equiv \eta(\mathbf \theta) and A(η1,η2)=g(θ)A(\mathbf{\eta_1, \eta_2}) = -g(\theta). Solving the equation, we have

A(η1,η2)=η124η2+12log(2η2).A(\eta_1, \eta_2) = \frac{\eta^2_1}{4\eta_2} + \frac{1}{2}\log(-2\eta_2).

We know that η1\eta_1 corresponds to ϕ(x)1\phi(x)_1, i.e. xx. If we compute the partial derivative η1A(η)\frac{\partial}{\partial \eta_1} A(\eta) and 22η1A(η)\frac{\partial^2}{\partial^2 \eta_1} A(\eta), we get

η1A(η1,η2)=η12η2=μ22η1A(η1,η2)=12η2=σ2.\begin{aligned} \frac{\partial}{\partial \eta_1}A(\eta_1, \eta_2) &= \frac{\eta_1}{2\eta_2} \\ &= \mu \\ \frac{\partial^2}{\partial^2 \eta_1}A(\eta_1, \eta_2) &= \frac{1}{2\eta_2} \\ &= \sigma^2. \end{aligned}

That means we discover XX's true mean (first moment) and variance (second moment) of the guassian distribution by differentiating its cumulant A()A(\cdot).


While writing this article, I was relying on Prof. M. Opper & Théo's lecture slides for Probabilistic Bayesian Modelling course (Summer 2020) and Prof. M. Jordan's reading matertial for his Bayesian Modeling and Inference (2010).

The first figure was made with Google Colab.