Gaussian Processes

· 5 min read

Bayesian Inference

The parameter θ in Bayesian Inference is viewed as a random variable. Usually, the following steps are taken:

  1. Choose a generative model p(x|θ) for the data
  2. Choose a prior distribution π(θ)
  3. After observing data points {x1,x2,..,xn} calculate posterior distribution p(θ|x1,..,xn)

Bayes’ Theorem

A simple consequence of conditional probability:

P(A|B)=P(AB)P(B)=P(B|A)P(A)P(B)

Using this theorem, we can write down the posterior distribution:

P(θ|x1,..,xn)=P(x1,..,xn|θ)π(θ)P(x1,..,xn)=Ln(θ)π(θ)cnLn(θ)π(θ)

where Ln(θ) is the likelihood function and:

cn=Ln(θ)π(θ)dθ

Is the normalizing constant or evidence?

Example 1:

Take model XBernoulli(θ). Flipping a coin (i.e., X{0,1}) is an example of this distribution. Natural prior over parameter θ is Beta(α,β) distribution: πα,β(θ)θα1(1θ)β1

α and β determine the shape of the prior distribution. A perfect property of this model is that the posterior distribution is also the Beta distribution.

Observing data:

Let s=i=1nxi be the number of heads over n trials. Posterior distribution θ|x1,..,xn is another Beta distribution:

α~=α+number of heads=α+sβ~=β+number of tails=α+ns

Bernoulli distribution is conjugate to the Beta prior. We will cover in this post that the Gaussian process is similar in that the posterior distribution has the same form as the prior distribution.

Dirichletα(θ)

A general form for the coin problem is dice trial:

Dirichletα(θ)θ1α11θ2α21..θKαK1

where α=(α1,..,αk)R+K is a non-negative vector. For example, in a regular dice, we have six parameters. Probability of rolling 1 is θ1α11 and etc.

Example

Indeed, when we see more and more data, the Dirichlet distribution will concentrate on proper parameters, or the variance will be 0. We will see that the pick of the posterior or maximum posterior (MAP) estimator is the mode of the data.

Nonparametric Bayesian Inference

Here we want to make Bayesian inference for functions. Typically neither the prior nor the posterior have a density, but the rear is still well-defined. We can’t compute posterior distribution, but we can sample from it.

Example

Stochastic Processes

In stochastic processes, we have a set of random variables {Xi}t. The other way of representing stochastic functions is:

tXt(ω)

It is now a random function indexed by time (let’s say a time series). I will get another stochastic function if I have another random ω. Recall that ω is coming from probability space here, and we are drawing samples from there and computing Xt(ω).

Gaussian Processes

Suppose I have a Gaussian distribution in 2 dimension. Let’s say we have the following: X=(X1X2)=Normal((00),(K11K12K21K22))

What can we tell about matrix K?

  1. It’s symmetric: K12=K21
  2. It’s positive definite: K0

Which are Mercer properties. The conditionals are also Gaussian:

X2|X1=Normal(K12K22X2,K11K122K22)X1|X2=Normal(K12K11X1,K22K122K11)

If these random variables are not correlated then K12 and K21 will be 0. So the posterior will be μ=0 and Σ=K22.

How about multidimensional Gaussian?

X=(X1X2X3)=Normal((000),(K11K12K13K21K22K23K31K32K33))

Again:

  1. It’s symmetric: K12=K21
  2. It’s positive definite: K0

Stochastic process is a Gaussian process m if for every finite set X1,X2,..,XN, m(x1),m(x2),..,m(xN) is normally distributed

Example

X=(m(X1)m(XN))=Normal(μ(X),K(X))

Where:

K(X)=(K(Xi,Xj))

Where K is a Mercer kernel.

Let’s fix some values X1,X2,..,XN. K will indicate our covariance matrix. What is the prior distribution over m?

π(m)=(2π)n/2|K|1/2exp(12mTK1m)

In other words, our prior over these n points is π(m).

Similar to Mercer kernel regression, now we can do a change of variable m=Kα where αNormal(0,K1). So specifying prior on m is equivalent to specifying prior on α: π(α)=(2π)n/2|K|1/2exp(12αTKα)

What functions have high probability according to the Gaussian process prior? The prior favors mK1m being small.

Recall

Let v is eigenvector of K with eigenvalue λ then:

1λ=vTK1v

Using the likelihood

We observe Yi=m(Xi)+ϵi where ϵiNormal(0,σ2).

logp(Yi|Xi)=12σ2(Yim(Xi))2

And then:

logπ(m)logexp(1/2mTK1m)1/2mTK1m

If we combine them: logp(Y|X,m)+logπ(m)=12σ2i(Yim(Xi))21/2mTK1m=12σ2(YKα)21/2αTKα

MAP estimation

This is MAP estimation: α^=argmax||YKα||2+σ2αTKα

And we have a solution for this based on Mercer’s kernel:

α^=(K+σ2I)1Y

and our estimation for m^:

m^=Kα^=K(K+σ2I)1Y

Note: a general formula for n dimensional data points:

Suppose (X1,X2) are jointly Gaussian with distribution:

X=(X1X2)=Normal((μ1μ2),(ACCTB))

Then the conditional distributions are:

X1|x2=Normal(μ1+CB1(x2μ2),ACB1CT)X2|x1=Normal(μ2+CTA1(x1μ1),BCTA1C)

The covariance matrix will be in a similar form as follows: Example

We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty, where I was TF.