Representer Theorem

In the previous post, we talked about Mercer’s theorem and defined a mercer kernel as $\int f(x) f(y) k(x,y) dx dy \geq 0$ for any function $f$. We call a matrix $M$ positive semi-definite if $f^TMf\geq 0$ for any $f\in \mathbb{R}^n$. Then eigenvalues must be non-negative or:

$$ \lambda_1 \geq \lambda_2 \geq .. \geq \lambda_n \geq 0 $$

Reminder: Eigenvectors and eigenfunctions

The eigenvector of a matrix $M$ is obtained by: $$ Mv_j = \lambda v_j $$

Similarly, we can define eigenfunction when $M$ is limited in function space:

$$ Kf = \lambda_j f $$

Where by definition:

$$ (Kf)(x)= \int K(x,y) f_j(y)dy $$

Let’s suppose that $K$ has this magical property:

$$ \int K(x-y) f(y) dy $$

Then there is a name for this integral: convolution. We are taking a kernel and then convolving that kernel concerning function $f$.

Hilbert Space

We define Hilbert’s space of functions and then determine the inner product and norm in that space of functions. $$ { \sum \alpha_j K(x_j,.) =f(.) } $$

Reminder

In kernel density estimation, we define kernels centered at data points $x_i$: $1/hK(\frac{x_i-.}{h})$. But, in Hilbert space, there are no data points yet (i.e., $x_i$), and we instead have parameter $x_j$. Furthermore, those $\alpha_j$ are arbitrary and could be either positive or negative. On the other hand, in kernel density estimation, those $\alpha_j$ are uniformly distributed like $1/n$.

Now let’s define norms and inter products in this space:

$$ ||f||_K = <f,f>_K $$

Let’s define functions:

$$ \begin{aligned} f(.) = \sum_j \alpha_j K(x_j,.) \\ g(.) = \sum_i \beta_i K(x_i,.) \end{aligned} $$

Then inner product is defined: $$ <f,g> = \sum_{i,j} \alpha_j \beta_i K(x_j,y_i) $$

Then we can define the norm:

$$ ||f||_K^2 = \sum \alpha_i \alpha_j K(x_i,x_j) $$

(note: the dimension of this space is infinite).

Is $<f,g>$ well defined?

If we define the same function in two different ways, we will have a problem. If we have another representation for $f$ and another one for $g$, does the inteproduct change? If it does, we will have a problem.

$$ \sum_{i,j} \alpha_j \beta_i K(x_j,x_i) $$

Remember that kernel space has the reproducing property:

$$ <K(x_i,.),f> = f(x) $$

Then:

$$ \begin{aligned} & \sum_{i,j} \alpha_j \beta_i <K(x_j,.),K(x_i,.)> \\ &= \sum_j \beta_j <K(x_i,.),\sum_i \beta_i K(y_i,.)>_K \\ &= \sum_j \beta_j <K(x_j,.),g(.)>_K \\ &= \sum_j \beta_j g(x_j) \end{aligned} $$

which doesn’t depend on $\beta_i$ and $y_i$. Then it doesn’t depend on how I represent $g$. Then this is well defined. This property does a lot of stuff work with the Mercer kernel. In kernel trick, you can take an algorithm and then Kernelize it.

Representer Theorem: Makes regression over RKHS works

What presenter theorem says the following is infinite dimensional regression:

$$ \begin{aligned} \hat{f}=\arg \min ||y-f(x)||^2+\lambda||f||_K^2 \\ \end{aligned} $$

This is basically minimizing $\sum (y_i-f(x_i))^2$ over training data. Then representer theorem also says that the following is a finite dimensional optimization:

$$ \hat{f} = \sum_i^{n} \alpha_i K(x_i,.) \quad \text{for some } \alpha_i \in \mathbb{R}^n $$

By solving this:

$$ \hat{\alpha} = \arg \min_{\alpha in \mathbb{R}^n} || Y -K \alpha ||^2 + \alpha^TK\alpha $$

where $K =[K(x_i,x_j)]_{n\times n} $

Why is this true?

Let’s define:

$$ V = \text{span } \Big(K(x_1,.), K(x_2,.), ..,K(x_n,.)\Big) $$

Let’s define reproducing kernel Hilbert space:

$$ \mathcal{H}_K = V \oplus V^{\perp} $$

is infinite dimmensional. $ V^{\perp}$ is the set of functions that are orthogonal to each one of $K(x_1,.), K(x_2,.), .., K(x_n,.)$.

$$ \begin{aligned} f \in \mathcal{H} : f = & v+g \\ & v \in V \\ & g \in V^{\perp} \end{aligned} $$

Then we can write:

$$ \begin{aligned} f(x_i) = & <K(x_i,.),f> \\ = & <K(x_i,.),v+g> \\ = & <K(x_i,.),v> + <K(x_i,.),g> \end{aligned} $$

By definition, $g$ is perpendicular to all of these functions. Then:

$$ <K(x_i,.),g>=0 $$

Remark: When I take a datapoint $x_i$ and evaluate it on $f$, it doesn’t depend on what is on orthogonal complement $g$.

Now let’s do the same thing for the norm:

$$ \begin{aligned} ||f||_K^2 = & ||v||_K^2 + ||g||_K^2 \\ =& \sum \alpha^T K \alpha \end{aligned} $$

We showed that squared error doesn’t depend on $g$. Now it’s time for the penalty part. To minimize squared error, we can always mess with $\alpha$. But I can always reduce the penalty by taking $g$ small and more petite for a given squared error. Then minimizer has always $g=0$

Remark: We showed that $\arg \min ||y-f(x)||^2+\lambda||f||_K^2$ is finite dimmensional ridge regression.

Now let’s get back to matrices. Let’s define $M$ is symmetric and positive definite matrix (i.e., $\lambda_j$s are real and $\lambda_j\geq 0$):

$$ M = U^T \Lambda U $$

where $U$ is a orthogonal matrix and $\Lambda$ is a diagonal matrix.

$$ \Lambda = \begin{bmatrix} \lambda_1 & .. & \\ \vdots &\ddots& \vdots \\ & .. &\lambda_n \end{bmatrix} $$

Then we can write: $$ \begin{aligned} M =& U^T \Lambda^{1/2} \Lambda^{1/2} U \\ =& (\sqrt{\Lambda} U)^T(\sqrt{\Lambda} U) \\ = & \Phi^T \Phi \end{aligned} $$

factors $\Phi \in \mathbb{R}^n$.

Feature representation for Mercer kernels

These factors are critical in machine learning because we can take them as features. For example, feature vector $\Phi_i \in \mathbb{R}^n$ for $i-$th data point. Indeed, features are dual forms for kernels. For example, we can take $X$ and make $G=XX^T$. We can show that it’s a Mercer’s kernel because it’s positive and semi-definite. If I take Eigen functions of k:

$$ K \Phi_j = \lambda_j \Phi_j, \quad \lambda_j \geq 0 $$

Then we can define feature mapping:

$$ X \rightarrow \Phi(x) : (\sqrt{\lambda} \Phi_1,\sqrt{\lambda_2}\Phi_2,..) \quad \text{infinite dimmensional} $$

Indeed, Mercer’s kernel gives you another way of manipulating features.

We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty, where I was TF.