Approximation Inference

Mar 7, 2022 · 5 min read

In the previous post, we talked about Gibbs sampling and posterior inference. However, computing Bayesian posteriors can be impractical. Two approaches:

Simulation
Variational Inference

Simulation is an unbiased and “right” way to do it.

The template for the generative model is:

Choose $Z$
Given $z$, generative (sample) X

We often want to invert this by posterior inference:

Given $X$
What is $Z$ that generated it?

If we have a random vector $Z\sim p(Z|x)$, we might want to compute the following:

marginal probabilities $P(Z_i=z|x)$
marginal means $\mathbb{E}(Z_i=z|x)$
most probable assignments $z^*= \arg\max_z P(\{Z_i=z_i\}|x)$. This could be called a maximum apposteriori (MAP) problem.
maximum marginals $z_i^* = \arg\max_{z_i} P(Z_i=z_i|x)$
joint probability $P(Z|x)$
joint mean $P(Z|x)$

We can only solve these problems in a reasonable amount of time. We have to approximate in some way.

Wrap up Gibbs sampling

Before continuing with simulation and variational methods, let’s finish derivations for Gibbs sampling.

$$ \begin{aligned} F \sim \text{DP}(\alpha,F_0) \\ \theta_1,\theta_2,..,\theta_n | F \sim F \\ X_i | \theta_i \sim f(x|\theta_i), i=1,2,..,n \end{aligned} $$

Last time we talked about one way of CRP that we have $\theta_i^*$:

Example

Today, we consider the case that we don’t have $\theta_i^*$. $Z_1,Z_2,..,Z_n$ and $\theta_1,\theta_2,..,\theta_n$. $Z_i$ will be which table the customer is sitting at (i.e., clustering of the data). In the previous example, $Z_1=1$ and $Z_2=2$. We don’t need those $\theta$.

In the Gibbs sampler, we iterate over $i$ from $1$ to $n$. We reassign $i$ to some table.

What is that, probabilistically?

we compute this $P(z_i|Z_{-i})$
we sample $z_i$ from this distribution.

Now, let’s compute: $$ P(Z_1=j|Z_2,Z_3,Z_4,Z_5) $$

There are $3$ different possibilities: $$ \begin{align} j=1,w_1: & \frac{1}{4+\alpha} p(x_1|x_5) \end{align} $$

EXample:

Assume $p(x|\theta)$ is a Normal distribution $\text{Normal}(\theta,\sigma^2)$. $$ F_0 = \text{Normal}(\mu_0,\tau_0^2) $$

The posterior: $$ \begin{align} \text{Normal}(\bar{\theta_n},\bar{\tau_n}^2) \\ \bar{\theta_n} = w_n \bar{X_n}+(1-w_n)\mu_0 \end{align} $$

Where:

$$ \begin{align} w_n = \frac{1}{1+\frac{\sigma^2/n}{\bar{\tau_n}^2}} \\ \frac{1}{\tau_n^2} = \frac{1}{\sigma^2_n} + \frac{1}{\tau_0^2} \end{align} $$

Therefore:

$$ p(x_1|x_5) = \text{Normal}(w_1x_5+(1-w_1)\mu_0,\tau_1^2+\sigma^2) $$

where $w_1$ and $\tau_1$ can be easily derived from $w_n$ and $\tau_n$ for $n=1$.

Similarly:

$$ \begin{align} j=2,w_2: & \frac{3}{4+\alpha} p(x_1|x_2,x_3,x_4) \\ =&w_3\frac{x_1+x_2+x_3}{3} + (1-w_3)\mu_0
\end{align} $$

**why $w_3$ Answer: because there are three data points.

And similarly, for variance and table 3.

How about the new table?

$$ \begin{align} j=\text{new table}, w_0: & \frac{\alpha}{4+\alpha} p(x_1|F_0) \end{align} $$

where $ p(x_1|F_0) = \text{Normal}(\mu_0,\tau_0^2+\sigma^2)$

Now we have $3$ numbers which are not necessarily sum up to one. We can normalize them. That gives me weights in a $ 3-sided die.

If the die comes up $w_2$, I move the customer to table $2$. If it’s in the new table, I will move to that table, and if it’s in table 1, I will leave her in table 1. All this is one step in Gibbs sampler. Then I take another data point $x_2$ and repeat.

The predictive density for new point $x$ given this clustering.

Example

$$ \begin{aligned} \text{table 1: } &\frac{1}{5+\alpha}p(x|x_5) + \text{table 3: }\frac{1}{5+\alpha}p(x|x_1)\\ +\text{table 2: }&\frac{3}{5+\alpha}p(x|x_2,x_4,x_5)+\text{new table: }\frac{\alpha}{5+\alpha}p(x|F_0) \end{aligned} $$

Gibbs sampling for the DPM

For each point $x_i$:

For every non-empty cluster $j$, compute: $$ w_j = \frac{n_{j,-i}}{n-1+\alpha} p(x_i|x_{i’} \text{ in cluster j for } i’\neq i) $$
For the empty cluster, compute:

$$ w_0 = \frac{\alpha}{n-1+\alpha}p(x_i|F_0) $$

Normalize $w_j\leftarrow w_j/\sum_kw_k$
Reassign $x_i$ to cluster $j$ with probability $w_j$, possibly starting a new cluster

For each clustering $c^{(b)}$, get a predictive density:

$$ p(x|c_{1:n}^{(b)},x_{1:n}) = \sum_j \frac{n_j^{(b)}}{n+\alpha} p(x|x_i \text{in cluster }c_j^{(b)}) + \frac{\alpha}{n+\alpha} p(x|F_0) $$

The posterior mean is approximated as:

$$ p(x_{n+1}|x_{1:n}) = \frac{1}{B}\sum_{b=1}^B p(x_{n+1}|c_{1:n}^{(b)},x_{1:n}) $$

Variational methods

Variational methods are fundamentally different. Gibbs sampling is a stochastic approximation, while variational methods give us a deterministic approximation. Variational methods use numerical convergence. Yet, in Gibbs sampling, it was a distributional notion of convergence.

Ising model

We have a graph with edges $E$ and vertices $V$. Each node $i$ has a random variable $Z_i$ that can be up $z_i=1$ or down $Z_i=0$:

$$ P_{\beta}(Z_1,..,Z_n) \propto \exp\Big( \sum_{s\in V}\beta_s Z_s+ \sum_{(s,t)\in E}\beta_{st}z_sz_t \Big) $$

You can think of these atoms as politicians and $Z_i$ as votes. For example, politicians are heavily connected to members of their party and weakly connected with those outside their party.

I can’t compute mean posterior, marginal probability, or anything else. I need either stochastic approximation or variational approximation.

Gibbs sampler (stochastic approximation)

Choose vertex $s\in V$ at random
Sample $u\sim \text{Uniform}(0,1)$ and update:

$$ Z_s = \begin{cases} 1 & u \leq (1+\exp(-\beta_s-\sum_{t\in N(s)}\beta_{st}z_t))^{-1}\\ 0 & \text{ otherwise } \end{cases} $$

where $\beta_{st}$ denotes strength of influence node $s$ has on node $t$. This is precisely like Logistic regression.

Iterate

Mean field variational algorithm (deterministic approximation)

Choose vertex $s\in V$ at random
Update:

$$ \mu_s = (1+\exp(-\beta_s-\sum_{t\in N(s)}\beta_{st}z_t))^{-1} $$ 3. Iterate

Deterministic vs. stochastic approximation

The $z_i$ variables are random
The $\mu_i$ variables are deterministic
The Gibbs sampler convergence is in the distribution
The mean field convergence is numerical
The Gibbs sampler approximates the full distribution
The mean field algorithm approximates the mean of each node

Example 2: A finite mixture model

Fix two distributions $F_0$ and $F_1$: $$ \begin{align} \theta \sim \text{Beta}(\alpha,\beta) \\ X| \theta \sim \theta F_1 +(1-\theta)F_0 \end{align} $$

the likelihood for the data $x_1,..,x_n$ is: $$ p(x_{1:n}) =\int_0^1 \text{Beta}(\theta|\alpha,\beta) \prod_{i=1}^n(\theta f_1(x_i)+(1-\theta)f_0(x_i))d\theta $$

Goal: to approximate the posterior $p(\theta|x_{1:n})$

Gibbs sampler (Stochastic approximation)

sample $Z_i| \theta,x_{1:n}$
sample $\theta| z_{1:n},x_{1:n}$

first step: independently sample $u_i\sim \text{Uniform}(0,1)$ and:

$$ Z_i = \begin{cases} 1 & u_i \leq \frac{\theta f_1(x_i)}{\theta f_1(x_i)+(1-\theta)f_0(x_i)}\\ 0 & \text{ otherwise } \end{cases} $$

second step:

$$ \theta \sim \text{Beta}\Big( \sum_{i=1}^n z_i +\alpha,n- \sum_{i=1}^n z_i +\beta \Big) $$

Posterior $p(\theta|x_{1:n})$ is approximated as a mixture of Beta distributions, number of components is $n + 1$. We can also not have $\theta$ like collapsed Gibbs sampler.

Variational Inference (deterministic approximation)

Iterate the following steps for variational parameters $q_{1:n}$ and $(\gamma_1,\gamma_2)$

Holding $q_i$ fixed, set $\gamma=(\gamma_1,\gamma_2)$ to $$ \gamma_1 = \alpha + \sum_{i=1}^n q_i , \gamma_2 = \beta+n-\sum_{i=1}^n q_i $$
Holding $\gamma_1$ and $\gamma_2$ fixed, set $q_i$ to: $$ q_i = \frac{f_1(x_i)\exp \psi(\gamma_1)}{f_1(x_i)\exp \psi(\gamma_1)+f_0(x_i)\exp \psi(\gamma_2)} $$

After convergence, the approximate posterior distribution over $\theta$ is:

$$ \hat{p}(\theta| x_{1:n}) = \text{Beta}(\theta|\gamma_1,\gamma_2) $$

We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty, where I was TF.