Approximation Inference
In the previous post, we talked about Gibbs sampling and posterior inference. However, computing Bayesian posteriors can be impractical. Two approaches:
 Simulation
 Variational Inference
Simulation is an unbiased and “right” way to do it.
The template for the generative model is:
 Choose $Z$
 Given $z$, generative (sample) X
We often want to invert this by posterior inference:
 Given $X$
 What is $Z$ that generated it?
If we have a random vector $Z\sim p(Zx)$, we might want to compute the following:
 marginal probabilities $P(Z_i=zx)$
 marginal means $\mathbb{E}(Z_i=zx)$
 most probable assignments $z^*= \arg\max_z P(\{Z_i=z_i\}x)$. This could be called a maximum apposteriori (MAP) problem.
 maximum marginals $z_i^* = \arg\max_{z_i} P(Z_i=z_ix)$
 joint probability $P(Zx)$
 joint mean $P(Zx)$
We can only solve these problems in a reasonable amount of time. We have to approximate in some way.
Wrap up Gibbs sampling
Before continuing with simulation and variational methods, let’s finish derivations for Gibbs sampling.
$$ \begin{aligned} F \sim \text{DP}(\alpha,F_0) \\ \theta_1,\theta_2,..,\theta_n  F \sim F \\ X_i  \theta_i \sim f(x\theta_i), i=1,2,..,n \end{aligned} $$
Last time we talked about one way of CRP that we have $\theta_i^*$:
Today, we consider the case that we don’t have $\theta_i^*$. $Z_1,Z_2,..,Z_n$ and $\theta_1,\theta_2,..,\theta_n$. $Z_i$ will be which table the customer is sitting at (i.e., clustering of the data). In the previous example, $Z_1=1$ and $Z_2=2$. We don’t need those $\theta$.
In the Gibbs sampler, we iterate over $i$ from $1$ to $n$. We reassign $i$ to some table.
What is that, probabilistically?
 we compute this $P(z_iZ_{i})$
 we sample $z_i$ from this distribution.
Now, let’s compute: $$ P(Z_1=jZ_2,Z_3,Z_4,Z_5) $$
There are $3$ different possibilities: $$ \begin{align} j=1,w_1: & \frac{1}{4+\alpha} p(x_1x_5) \end{align} $$
EXample:
Assume $p(x\theta)$ is a Normal distribution $\text{Normal}(\theta,\sigma^2)$. $$ F_0 = \text{Normal}(\mu_0,\tau_0^2) $$
The posterior: $$ \begin{align} \text{Normal}(\bar{\theta_n},\bar{\tau_n}^2) \\ \bar{\theta_n} = w_n \bar{X_n}+(1w_n)\mu_0 \end{align} $$
Where:
$$ \begin{align} w_n = \frac{1}{1+\frac{\sigma^2/n}{\bar{\tau_n}^2}} \\ \frac{1}{\tau_n^2} = \frac{1}{\sigma^2_n} + \frac{1}{\tau_0^2} \end{align} $$
Therefore:
$$ p(x_1x_5) = \text{Normal}(w_1x_5+(1w_1)\mu_0,\tau_1^2+\sigma^2) $$
where $w_1$ and $\tau_1$ can be easily derived from $w_n$ and $\tau_n$ for $n=1$.
Similarly:
$$
\begin{align}
j=2,w_2: & \frac{3}{4+\alpha} p(x_1x_2,x_3,x_4) \\
=&w_3\frac{x_1+x_2+x_3}{3} + (1w_3)\mu_0
\end{align}
$$
**why $w_3$ Answer: because there are three data points.
And similarly, for variance and table 3.
How about the new table?
$$ \begin{align} j=\text{new table}, w_0: & \frac{\alpha}{4+\alpha} p(x_1F_0) \end{align} $$
where $ p(x_1F_0) = \text{Normal}(\mu_0,\tau_0^2+\sigma^2)$
Now we have $3$ numbers which are not necessarily sum up to one. We can normalize them. That gives me weights in a $ 3sided die.
If the die comes up $w_2$, I move the customer to table $2$. If it’s in the new table, I will move to that table, and if it’s in table 1, I will leave her in table 1. All this is one step in Gibbs sampler. Then I take another data point $x_2$ and repeat.
The predictive density for new point $x$ given this clustering.
$$ \begin{aligned} \text{table 1: } &\frac{1}{5+\alpha}p(xx_5) + \text{table 3: }\frac{1}{5+\alpha}p(xx_1)\\ +\text{table 2: }&\frac{3}{5+\alpha}p(xx_2,x_4,x_5)+\text{new table: }\frac{\alpha}{5+\alpha}p(xF_0) \end{aligned} $$
Gibbs sampling for the DPM
For each point $x_i$:

For every nonempty cluster $j$, compute: $$ w_j = \frac{n_{j,i}}{n1+\alpha} p(x_ix_{i’} \text{ in cluster j for } i’\neq i) $$

For the empty cluster, compute:
$$ w_0 = \frac{\alpha}{n1+\alpha}p(x_iF_0) $$
 Normalize $w_j\leftarrow w_j/\sum_kw_k$
 Reassign $x_i$ to cluster $j$ with probability $w_j$, possibly starting a new cluster
For each clustering $c^{(b)}$, get a predictive density:
$$ p(xc_{1:n}^{(b)},x_{1:n}) = \sum_j \frac{n_j^{(b)}}{n+\alpha} p(xx_i \text{in cluster }c_j^{(b)}) + \frac{\alpha}{n+\alpha} p(xF_0) $$
The posterior mean is approximated as:
$$ p(x_{n+1}x_{1:n}) = \frac{1}{B}\sum_{b=1}^B p(x_{n+1}c_{1:n}^{(b)},x_{1:n}) $$
Variational methods
Variational methods are fundamentally different. Gibbs sampling is a stochastic approximation, while variational methods give us a deterministic approximation. Variational methods use numerical convergence. Yet, in Gibbs sampling, it was a distributional notion of convergence.
Ising model
We have a graph with edges $E$ and vertices $V$. Each node $i$ has a random variable $Z_i$ that can be up $z_i=1$ or down $Z_i=0$:
$$ P_{\beta}(Z_1,..,Z_n) \propto \exp\Big( \sum_{s\in V}\beta_s Z_s+ \sum_{(s,t)\in E}\beta_{st}z_sz_t \Big) $$
You can think of these atoms as politicians and $Z_i$ as votes. For example, politicians are heavily connected to members of their party and weakly connected with those outside their party.
I can’t compute mean posterior, marginal probability, or anything else. I need either stochastic approximation or variational approximation.
Gibbs sampler (stochastic approximation)
 Choose vertex $s\in V$ at random
 Sample $u\sim \text{Uniform}(0,1)$ and update:
$$ Z_s = \begin{cases} 1 & u \leq (1+\exp(\beta_s\sum_{t\in N(s)}\beta_{st}z_t))^{1}\\ 0 & \text{ otherwise } \end{cases} $$
where $\beta_{st}$ denotes strength of influence node $s$ has on node $t$. This is precisely like Logistic regression.
 Iterate
Mean field variational algorithm (deterministic approximation)
 Choose vertex $s\in V$ at random
 Update:
$$ \mu_s = (1+\exp(\beta_s\sum_{t\in N(s)}\beta_{st}z_t))^{1} $$ 3. Iterate
Deterministic vs. stochastic approximation
 The $z_i$ variables are random
 The $\mu_i$ variables are deterministic
 The Gibbs sampler convergence is in the distribution
 The mean field convergence is numerical
 The Gibbs sampler approximates the full distribution
 The mean field algorithm approximates the mean of each node
Example 2: A finite mixture model
Fix two distributions $F_0$ and $F_1$: $$ \begin{align} \theta \sim \text{Beta}(\alpha,\beta) \\ X \theta \sim \theta F_1 +(1\theta)F_0 \end{align} $$
the likelihood for the data $x_1,..,x_n$ is: $$ p(x_{1:n}) =\int_0^1 \text{Beta}(\theta\alpha,\beta) \prod_{i=1}^n(\theta f_1(x_i)+(1\theta)f_0(x_i))d\theta $$
Goal: to approximate the posterior $p(\thetax_{1:n})$
Gibbs sampler (Stochastic approximation)
 sample $Z_i \theta,x_{1:n}$
 sample $\theta z_{1:n},x_{1:n}$
first step: independently sample $u_i\sim \text{Uniform}(0,1)$ and:
$$ Z_i = \begin{cases} 1 & u_i \leq \frac{\theta f_1(x_i)}{\theta f_1(x_i)+(1\theta)f_0(x_i)}\\ 0 & \text{ otherwise } \end{cases} $$
second step:
$$ \theta \sim \text{Beta}\Big( \sum_{i=1}^n z_i +\alpha,n \sum_{i=1}^n z_i +\beta \Big) $$
Posterior $p(\thetax_{1:n})$ is approximated as a mixture of Beta distributions, number of components is $n + 1$. We can also not have $\theta$ like collapsed Gibbs sampler.
Variational Inference (deterministic approximation)
Iterate the following steps for variational parameters $q_{1:n}$ and $(\gamma_1,\gamma_2)$

Holding $q_i$ fixed, set $\gamma=(\gamma_1,\gamma_2)$ to $$ \gamma_1 = \alpha + \sum_{i=1}^n q_i , \gamma_2 = \beta+n\sum_{i=1}^n q_i $$

Holding $\gamma_1$ and $\gamma_2$ fixed, set $q_i$ to: $$ q_i = \frac{f_1(x_i)\exp \psi(\gamma_1)}{f_1(x_i)\exp \psi(\gamma_1)+f_0(x_i)\exp \psi(\gamma_2)} $$
After convergence, the approximate posterior distribution over $\theta$ is:
$$ \hat{p}(\theta x_{1:n}) = \text{Beta}(\theta\gamma_1,\gamma_2) $$
We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty, where I was TF.