Mercer's Theorem

Jan 16, 2022 · 7 min read

Given data points $(X_1,Y_1)$ , $(X_2,Y_2)$ , .. , and $(X_n,Y_n)$ where we want to predict $Y$ via design matrix $X$. Let’s say $Y_i=m(X_i)+\epsilon_i$ where $m(x)$ is a smooth function of $x$.

In statistics, the most popular methods to derive $m(x)$ are kernel methods:

Smoothing kernels
Mercer kernels

The primary distinction between smoothing and Mercer kernels is that the former employs local averaging, whereas the latter employs regularization.

Smoothing kernel estimator

$$ \hat{m_h}(x) = \frac{\sum_{i=1}^n Y_i K_h(X_i,x)}{\sum_{i=1}^n K_h(X_i,x)} $$

where $K_h(x,z)$ is a kernel such as: Example

For example let’s say our kernel is a Gaussian: $$ K_h(x,z) = \exp \Big( -\frac{||x-x||^2}{2h^2} \Big) $$

$\hat{m_h}(x)$ is just a local average of the $Y_i$’s near $x$. Furthermore, the bandwidth $h$ controls the bias-variance tradeoff where smaller $h$ indicates large variance and large $h$ implies large bias.

Example

Risks in smoothing kernels

$$ \text{Risk} = \mathbb{E}[Y-\hat{m_h}(X)]^2 = \text{bias}^2+\text{variance}^2+\sigma^2 $$

Let’s make an assumption on distribution of data:

$$ \begin{aligned} \text{bias}^2 \approx h^4 \\ \text{variance} \approx \frac{1}{nh^p} \end{aligned} $$

where $p=$ dimension of $X$ and $\sigma^2=\mathbb{E}[Y-m(X)]^2$ is the unavoidable prediction error. We can think of this another way: when we set $h$ small bias is low and variance is high implying undersmoothing. On the other hand, when $h$ is large, we are averaging over many surrending points that causes high bias and low variance; this is called oversmoothing.

Example

Thus, we need to find the optimum value for $h$.

Howe we can estimate Risk?

Let’s indicate $\hat{Y_{-i}}=\hat{m}_{h,-i}(X_i)$ the estimation based on all but $X_i$. To estimate risk using cross-validation:

$$ \hat{R}(h) = \frac{1}{n} \sum_{i=1}^n(Y_i-\hat{Y}_{-i})^2 $$

We can show this in a very compact form as follows:

$$ \hat{R}(h) = \frac{1}{n} \sum_{i=1}^n\Big(\frac{Y_i-\hat{Y_i}}{1-L_{ii}}\Big)^2 $$

where $L_{ii} = \frac{K_h(X_i,X_i)}{\sum_{i’=1}^n K_h(X_i,X_{i’})}$. This leads to a big computational savings since it shows us that, to compute leaveone-out cross-validation error, we don’t have to actually ever compute $\hat{Y}_{-i}$. To prove this the following hints would be helpful:

$$ \begin{aligned} \sum_{i’}{K_h(X_i,X_{i’})} = K_h(X_i,X_i) + \sum_{i’\neq i}{K_h(X_i,X_{i’})} \\ Y_i-\hat{Y_i} = Y_i - \frac{\sum_{i’}Y_{i’}K_h(X_{i’},X_i)}{\sum_{i’}K_h(X_{i’},X_i)} \end{aligned} $$

The curse of dimensionality

We can apply the method for any dimmensionality $h$. Yet, it doesn’t work as well.

The squared bias scales as $h^4$ and the variance scales as $\frac{1}{nh^p}$
Then, the risk goes down no faster than $n^{-4/(4+p)}$

$$ n^{\frac{4}{4+p}} R(\hat{m}_n,m) > 0 $$

How many data points do we need to make this risk as small as $\epsilon$:

$$ n\geq \Big(\frac{1}{\epsilon}\Big)^{1+p/4} $$

But, why? What does our assumption on distribution of data imply? Let’s consider a very simple example with a uniform distribution of data falling under a squared kernel. Suppose the support of the distribution of inputs is an $l_{\infty}$-ball of radius $\frac{1}{2}$ in dimension $p=2$, then with $h_n=1/8$, in order to have a consistent estimation for each of the $(2h_n)^{-p}$=16 candidate test points $\hat{x_1},..,\hat{x_16}$, the number of points in each of the $m=(2h_n)^{−p}$ balls has to grow unbounded and thus $\frac{n}{m}=n(2h_n)^p$ has to grow unbounded. Example

In other words:

$$ \begin{aligned} h_n &\rightarrow 0\\ nh_n^p &\rightarrow + \infty \end{aligned} $$

Therefore, in order to obtain a consistent estimator, the bandwidth of translation-invariant kernels has to go to zero slowly enough, slower than $n^{−1/p}$, and the final estimation error is converging to zero, but slower than $n^{−1/p}$. This is the usual curse of dimensionality.In order to obtain a certain error $\epsilon$, the number of observations has to grow at least as $\epsilon^{-p}$, and thus exponentially in dimension. When the underlying function has weak smoothness properties (here just Lipschitz-continuous), such a dependence in $p$ is provably unavoidable in general (see here for more information).

Mercer’s kernel

Instead of using local smoothing, we can optimize the fit to the data subject to regularization (penalization). For example, let’s say choose $\hat{m}$ to minimize:

$$ \sum_i (Y_i -\hat{m}(X_i))^2 +\lambda \text{penalty}(\hat{m}) $$

Sometimes people call $\text{penalty}(\hat{m})$ roughness penalty. $\lambda$ is the parameter that controls the amount of smoothing.

How do we construct a penalty that measures roughness?

There are two approaches:

Mercer’s kernel
Reproducing Kernel Hilbert Spaces (RKHS)

A Mercer kernel $k(x,x’)$ is symmetric and positive definite:

$$ \int \int f(x) f(x’) K(x,x’) dx dx’ \geq 0 \quad \forall f $$

For example $K(x,x’)=e^{-||x-x’||^2/2}$ is a Mercer kernel.

If $k(x,x’)$ is a similarity function between $x$ and $x’$, then we will create a set of basis functions based on $k$. For a fixed $z$ $K(z,x)=K_z(x)$ is a function of $x$.

Recall-1

Recall that a kernel function computes the inner product of the feature maps under an embedding $\phi$ of two data points:

$$ K(x,z) = <\phi(x) ,\phi(z)> $$

verifying a kernel function (method 1):

Forming a matrix of the pairwise evaluations of a kernel function on a set of inputs gives a positive semi-definite matrix. This is one way of verifying that the function is a kernel, that is to construct a feature space for which the function corresponds to first performing the feature mapping and then computing the inner product between the two images. For example polynomial function is a kernel and the exponential of the cardinality of a set intersection is also a kernel.

verifying a kernel function (method 2):

We assume that $K$ satisfies the finitely positive semi-definite property and proceed to construct a feature mapping $\phi$ into a Hilbert space for which $K$ is the kernel. There is one slightly unusual aspect of the construction in that the elements of the feature space will in fact be functions. They are, however, points in a vector space and will fulfil all the required properties.

Lets:

$$ \mathcal{F} = \Big\{ f(.) = \sum_{j=1}^k \beta_j K(z_j,.) \Big\} $$

we define a norm: $||f||_K = \sum_j \sum_k \beta_j \beta_k K(z_j,z_k)$. If $||f||_K $ is small then we can say that $f$ is smooth.

If $f=\sum_r \alpha_r K(z_r,.)$ and $g=\sum_s \beta_s K(w_s,.)$, the inner product is:

$$ <f,g>_K = \sum_r \sum_s \alpha_r \beta_s K(z_r,w_s) $$

$\mathcal{F}$ is reproducing kernel Hilbert space (RKHS) since:

$$ <f,K(x,.)> = f(x) $$

Hint:

$$ \begin{aligned} f= & \sum_r \alpha_r K(z_r,.) \\ g= & \sum_s \beta_s K(w_s,.) \\ <f,g> = & \sum_r \sum_s \alpha_r \beta_s K(z_r,w_s) \\ = & \sum_r \alpha_r g(z_r) \\ = & \sum_s \beta_s f(w_s) \end{aligned} $$

where the second and third equalities follow from the definitions of $f$ and $g$. It is clear from these equalities that $⟨f, g⟩$ is real-valued, symmetric and bilinear and hence satisfies the properties of an inner product.

$$ <f,f> \geq 0 \quad \forall \quad f\in \mathcal{F} $$

but this follows from the assumption that all kernel matrices are positive semi-definite, since:

$$ \begin{aligned} <f,f> = & \sum_r \sum_s \alpha_r \alpha_s K(x_r,x_s) \\ =& \alpha’ \mathbb{K} \alpha \geq 0 \end{aligned} $$

then the last property:

$$ \begin{aligned} <f,K(x,.)> = & \sum_{r} \alpha_r K(x_r,x) \\ = & f(x) \end{aligned} $$

is directly coming from first few equations after Hint.

Nonparametric Regression: Mercer Kernels

Let’s get back to our original problem which was estimating $\hat{m}(x)$ by minimizing:

$$ J(m)= \sum_{i=1}^n (Y_i - m(X_i))^2 + \lambda ||m||^2_K $$

Then, $$ \hat{m}(x) = \sum_{i=1}^n \alpha_i K(X_i,x) $$

for some coefficients $\alpha_1,..,\alpha_n$.

How to find $\alpha$?

Let’s plug $\hat{m}(x) = \sum_{i=1}^n \alpha_i K(X_i,x)$ into minimizer $J$:

$$ \begin{aligned} J(\alpha) = & || Y- \mathbb{K} \alpha ||^2 +\lambda \alpha^T \mathbb{K} \alpha \\ \hat{\alpha} = & (\mathbb{K}+\lambda I )^{-1}Y \\ \hat{m}(x) = & \sum_i \hat{\alpha_i} K(X_i,x) \end{aligned} $$

where $K_{jk} = K(X_j,X_k)$. Like the smoothing kernel regression we choose $\lambda$ according to bias-variance tradeoff by cross-validation.

In smoothing kernels: the bandwidth h controls the amount of smoothing while In Mercer kernels: norm $||f||_K$ controls the amount of smoothing. In practice these two methods give answers that are very similar.

For more information about Mercer’s theorem see here.

Additive models

Is less parametric than kernel smoothing and Mercer’s kernel and is better in high dimensions.

$$ m(x)= \beta_0 + \sum_{j=1}^p m_j(x_j) $$

we can remove $\beta_0$ from now on. Basically, we want to minimize: $$ \sum_{i=1}^n \Big( Y_i - (m_1(X_{i1}))+ ..+ m_p(X_{ip})) \Big) $$

subject to $m_j$ smooth.

Backfitting algorithm

In this algorithm, we do following:

take data point $(X_i,Y_i)$
until convergence
for each j=1,..,p
compute residual: $R_j=Y-\sum_{k\neq j} \hat{m_k}(X_k)$
smooth $\hat{m_j}=S_jR_j$
return $\hat{m(x_i)} = \sum_j \hat{m_j}(X_{ij})$

Sparse additive models

In additive model we have $Y_i = \sum_{j=1}^p m_j(X_{ij}+\epsilon_i$ for $i=1,..,n$. In high dimension where $n« p$ most $m_j=0$ and we want to minimize :

$$ \mathbb{E}\Big( Y - \sum_{j=1}^p m_j(X_{ij} \Big)^2 $$

subject to: $$ \begin{aligned} \sum_{j=1}^p \sqrt{\mathbb{E}(m_j^2)} \leq L_n, \\ \mathbb{E}(m_j) = 0 \end{aligned} $$

This method generalizes to lasso.

Sparse backfitting algorithm

take data point $(X_i,Y_i)$ and regularization parameter $\lambda$
until convergence
for each j=1,..,p
compute residual: $R_j=Y-\sum_{k\neq j} \hat{m_k}(X_k)$
smooth $\hat{m_j}=S_jR_j$
estimate norm: $s_j= \sqrt{\mathbb{E}(m_j^2)}$
soft threshold: $\hat{m_j} \leftarrow \Big[ 1- \frac{\lambda}{\hat{s_j}}\Big]$
return $\hat{x_i} = \sum_j \hat{m_j}(X_{ij})$

This generalizes coordinate descent algorithm from last time.

We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty where I was TF.