Gaussian Processes
Bayesian Inference
The parameter
- Choose a generative model
for the data - Choose a prior distribution
- After observing data points
calculate posterior distribution
Bayes’ Theorem
A simple consequence of conditional probability:
Using this theorem, we can write down the posterior distribution:
where
Is the normalizing constant or evidence?
Example 1:
Take model
Observing data:
Let
Bernoulli distribution is conjugate to the Beta prior. We will cover in this post that the Gaussian process is similar in that the posterior distribution has the same form as the prior distribution.
Dirichlet
A general form for the coin problem is dice trial:
where
Indeed, when we see more and more data, the Dirichlet distribution will concentrate on proper parameters, or the variance will be
Nonparametric Bayesian Inference
Here we want to make Bayesian inference for functions. Typically neither the prior nor the posterior have a density, but the rear is still well-defined. We can’t compute posterior distribution, but we can sample from it.
Stochastic Processes
In stochastic processes, we have a set of random variables
It is now a random function indexed by time (let’s say a time series).
I will get another stochastic function if I have another random
Gaussian Processes
Suppose I have a Gaussian distribution in
What can we tell about matrix
- It’s symmetric:
- It’s positive definite:
Which are Mercer properties. The conditionals are also Gaussian:
If these random variables are not correlated then
How about multidimensional Gaussian?
Again:
- It’s symmetric:
- It’s positive definite:
Stochastic process is a Gaussian process if for every finite set , is normally distributed
Where:
Where
Let’s fix some values
In other words, our prior over these
Similar to Mercer kernel regression, now we can do a change of variable
What functions have high probability according to the Gaussian process prior?
The prior favors
Recall
Let
Using the likelihood
We observe
And then:
If we combine them:
MAP estimation
This is MAP estimation:
And we have a solution for this based on Mercer’s kernel:
and our estimation for
Note: a general formula for dimensional data points:
Suppose
Then the conditional distributions are:
The covariance matrix will be in a similar form as follows:
We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty, where I was TF.