Mixture Models and EM

· 4 min read

How to model data with a probability distribution? For example, if data looks like a circle or symmetric, we may model it with a Gaussian distribution:

Example

With the following density function:

Normal(μ,Σ)=exp(12(xμ)TΣ1(xμ))(2π)d2detΣ

Given i.i.d data points x1, x2, .., $x_n \sim \text{Normal}(x|\mu^,\Sigma^)$, maximum likelihood estimator is:

MLE(μ,Σ)=argmaxi=1nNormal(xi|μ,Σ) =argmini=1nlogNormal(xi|μ,Σ) =argmini=1n12(xμ)TΣ1(xμ)+12logdetΣ+d2log2π

The solution is:

μMLE=1ni=1nxi=x¯ ΣMLE=1ni=1n(xiμMLE)(xiμMLE)T

But how about data that looks like this?

Example

Here, Gaussian could be a better model, and we can not fit a multi-modal Gaussian. One solution is a mixture of Gaussian:

p(x)=π1Normal(x|μ1,Σ1)+..+πkNormal(x|μk,Σk)

The following plot is an example of a Gaussian mixture distribution in one dimension showing three Gaussians (each scaled by a coefficient) in blue and their sum in red.

p(x)=π1Normal(x|μ1,σ12)+π2Normal(x|μ2,σ22)+π3Normal(x|μ3,σ32), π1+π2+π3=1, π1,π2,π30.

Example

Mixture of Gaussian with K components

Let’s formalize a mixture of the Gaussian model:

p(x)=i=1KπkNormal(x|μk,Σk)

define z(0,1)k to encode which component x comes from: z=(z1,z2,..,zk) and zk(0,1) and z1+..+zk=1. Or in other words:

ze1=(1 0 0),e2=(0 1 0),e3=(0 0 1)

Graphical model:

p(x,z)=p(z).p(x|z)

  1. define marginal on z:

p(zk=1)=πkfor1kK

where π=(π1,..,πk) satisfies 0pik1, k=1Kπk=1. Because z=(z1,..,zk)(0,1)k we can write the following equation:

p(z)=k=1Kπkzk

since only one zk=1 and all zj=0,jk.

  1. define conditional:

p(x|zk=1)=Normal(x|μk,Σk) p(x|z)=k=1KNormal(x|μk,Σk)zk

  1. define joint probability:

p(x,z)=p(z).p(x|z) =k=1K(πkNormal(x|μk,Σk))zk

or p(x,zk=1)=πkNormal(x|μk,Σk)

  1. marginal is a mixture of Gaussian:

p(x)=zp(x,z)=k=1Kp(x,zk=1)=πkNormal(x|μk,Σk)

Therefore to sample from a mixture of Gaussian:

p(x)=kπkNormal(x|μk,Σk)

We can do the following:

If we think of p(zk=1)=πk as prior, then given x, we can compute posterior distribution:

γ(zk)=p(zk=1|x) =p(zk=1).p(x|zk=1)i=1Kp(zj=1).p(x|zj=1) =πk.Normal(x|μk,Σk)i=1Kπj.Normal(x|μj,Σj)

Maximum Likelihood Estimation for Gaussian Mixture

Given observations x1,x2,..xNRd, we want to model using mixture of K Gaussian (for a fixed K):

p(x)=kπkNormal(x|μk,Σk)

for each xnRd define latent variable zn(0,1)k:

Example

joint probability for x=(x1,..,xN) is defined as:

p(x)=n=1Np(xn) =n=1N(kπkNormal(x|μk,Σk))

based on MLE estimator: π=(π1,..πk) , μ=(μ1,..μk) , σ=(Σ1,..Σk) to maximize log-likelihood :

logp(x)=i=1Nlog(kπkNormal(x|μk,Σk)) =i=1Nexp(12(xμ)TΣ1(xμ))(2π)d2detΣ

  1. suppose we fix π=(π1,..,πk) , Σ=(Σ1,..,Σk), then we can optimize over μ=(μ1,..,μk):

μklogp(x)=i=1NπkNormal(xn|μk,Σk)j=1KNormal(xn|μj,Σj.Σk1(xnμk) 

where j=1KNormal(xn|μj,Σj)=γ(znk)=p(znk=1|xn). μklogp(x)=(Σk)1n=1Nγ(znk)(xnμk)

To find a minimizer, set the gradient to 0: μk=1Nkn=1Nγ(znk)xn which is weighted average data points where Nk=n=1Nγ(znk) is efficient number of data points assigned to cluster k.

  1. Similarly, if we fix μ,π we can optimize over Σ:

Σk=1Nkn=1Nγ(znk)(xnμk)(xnμk)T

  1. If we fix μ,Σ we can then optimize over π:

πk=NkN

The following plot illustrates the EM algorithm (Bishop Fig. 9.8). Example

We covered this post in the introduction to machine learning CPCS 481/581, Yale University, Andre Wibisono where I (joint with Siddharth Mitra) was TF.