Empirical Risk Minimization

Empirical Risk Minimization: Given $(x_1,y_1), .., (x_n,y_n) \in X \times y$ and we want to predict $f(x)=y$. Choose hypothesis class $\mathcal{F}={f_{\theta}: \theta \in \mathcal{\theta}}$. Choose: $$ \theta^* = \arg \min_{\theta \in \mathcal{\theta}} \sum_{i=1}^n \text{Loss}(f_{\theta}(x_i,y_i)) $$

Let’s review maximum likelihood estimation in this context. Maximum likelihood estimation is a probabilistic model defined by $y|x,\theta \sim P_{\theta}(y|x) = p(y| x,\theta)$. Our observations are $(x_1,y_1),..,(x_n,y_n) \in X \times y$.

$$ \begin{align} \begin{aligned} \theta_{\text{MLE}} = & \arg \max_{\theta} P_{\theta} (y_1|x_1)..P_{\theta} (y_n|x_n)\\\ =&\arg \max_{\theta} \sum_{i=1}^{n}\log P_{\theta} (y_i|x_i) \\\ =&\arg \min_{\theta} \sum_{i=1}^{n}-\log P_{\theta} (y_i|x_i) \end{aligned} \end{align} $$

and we know that $\text{Loss}f_{\theta}(x_i,y_i)=-\log P_{\theta} (y_i|x_i) $. In other words $\text{MLE} \equiv \text{ERM}$.

Regularized ERM as MAP

Regularized ERM is defined as:

$$ \theta^* = \arg \min_{\theta\in \mathcal{\theta}} \sum_{i=1}^n \text{Loss}f_{\theta}(x_i,y_i) + \lambda \text{Reg}(\theta) $$

MAP: Prior $\theta \sim \pi(\theta)$ and $y|\theta,x \sim p(y|x,\theta)$. Then we can define MAP with the following:

$$ \begin{align} \begin{aligned} \theta_{\text{MAP}} = & \arg \max_{\theta} P_{\theta} (y_1|x_1,\theta).. P_{\theta} (y_n|x_n) \pi(\theta)\\\ =&\arg \max_{\theta} \sum_{i=1}^{n}\log P_{\theta} (y_i|x_i,\theta) + \log \pi(\theta) \\\ =&\arg \min_{\theta} \sum_{i=1}^{n}-\log P_{\theta} (y_i|x_i) - \log \pi (\theta) \end{aligned} \end{align} $$

where $\text{Loss}f_{\theta}(y_i|x_i,\theta)=-\log P_{\theta} (y_i|x_i,\theta) $. In other words, $\text{MAP} \equiv \text{ Regularized } \text{ERM}$ with $\log$ loss and regularizer is given by log prior.

Example:

Regression: We know that $y = \theta.x$

Gaussian Model: observe $y = \theta.x + \epsilon$ where $\epsilon \sim \text{Normal}(0,\sigma^2)$. Recall that one dimensional Gaussian $\text{Normal}(\mu,\sigma^2)$ has density:

$$ \begin{align} \begin{aligned} \text{Normal}(y|\mu,\sigma^2) = \frac{1}{\sqrt(2\pi \sigma^2)} e ^{-\frac{(y-\mu)^2}{2\sigma^2}} \\\ \mu = \text{mean} = E[Y] \in \mathbb{R} \\\ \sigma^2 = \text{variance} = E[(y-\mu)^2] > 0 \end{aligned} \end{align} $$

In this case, probabilistic model is $p(y| x,\theta) = \text{Normal}(y| \theta.x,\sigma^2)$ where $\mu = \theta.x$. For ERM, the loss is: \begin{align} \begin{aligned} \text{Loss}(\theta,x,y) = -\log p(y|x,\theta) = \frac{(y-\theta.x)^2}{2\sigma^2} + \frac{1}{2} \log (2\pi \sigma^2) \end{aligned} \end{align}

Quadratic regularizer:

Assume our regularizer is $C(\theta) = \frac{1}{2}||\theta||^2$. Assume a Gaussian prior on $\theta \sim \text{Normal}(\mu,\Sigma)$ where $\Sigma = \Sigma^T \in R^{d\times d}$ with the following density:

$$ \text{Normal}(\theta| \mu,\Sigma) = \frac{1}{\text{det}(2\pi \Sigma)} e^{-\frac{(\theta-\mu)^T\Sigma^{-1}(\theta-\mu)}{2}} $$

where $\text{det}(2\pi \Sigma)=(2\pi)^d \text{det}(\Sigma)$. Then we can re-write the regularized function as: \begin{align} \begin{aligned} C(\theta) =& -\log \text{Normal}(\theta|\mu,\Sigma) \\\ =& \frac{(\theta-\mu)^T\Sigma^{-1}(\theta-\mu)}{2} + \frac{1}{2}\log \text{det}(2\pi \Sigma) \\\ =& \frac{(\theta-\mu)^T\Sigma^{-1}(\theta-\mu)}{2}+ \frac{1}{2}\log \text{det}(\Sigma) + \frac{d}{2}\log \text{det}(2\pi)
\end{aligned} \end{align}

where $\frac{d}{2}\log \text{det}(2\pi) $ is a constant. If we assume $\mu=0$ and $\Sigma=I$: $$ C(\theta) = \frac{1}{2} ||\theta||^2_2 + \text{c} $$

Ridge regression with Quadratic Regularizer:

This comes from the Gaussian-Gaussian model:

\begin{align} \begin{aligned} \theta \sim \text{Normal}(\mu,\Sigma ) \\\ y| x,\theta \sim \text{Normal}(x.\theta,\Sigma^2) \end{aligned} \end{align}

where $y=x.\theta+\epsilon$ and $\epsilon \sim \text{Normal}(0,\sigma^2)$.

L1 regularizer: $C(\theta) = ||\theta||1=\sum{i=1}^d |\theta_i|$

Laplace distribution: In one dimension: $p(\theta|\mu,b) = \frac{1}{2b}e^{-\frac{|\theta-\mu|}{b}} $ Laplace has a heavier tail than Gaussian.

In d-dimension: $p(\theta|\mu,b) = \frac{1}{(2b)^d}e^{-\frac{||\theta-\mu||1}{b}} = \prod{i=1}^d p(\theta_i| \mu_i,b) $ which corresponds to L1 regularizer: \begin{align} \begin{aligned} C(\theta) =& -\log p(\theta | \mu,b) \\\ =& \frac{1}{b}||\theta-\mu||_1 + d \log (2b) \end{aligned} \end{align}

Lasso regression: Quadratic loss and L1 regularization which comes from: \begin{align} \begin{aligned} \theta \sim \text{Lap}(\mu,b ) \\\ y| x,\theta \sim \text{Normal}(x.\theta,\sigma^2) (\text{ MAP estimation}) \end{aligned} \end{align}

We covered this post in the introduction to machine learning CPCS 481/581, Yale University, Andre Wibisono where I (joint with Siddharth Mitra) was TF.