Representer Theorem
In the previous post, we talked about Mercer’s theorem and defined a mercer kernel as $\int f(x) f(y) k(x,y) dx dy \geq 0$ for any function $f$. We call a matrix $M$ positive semi-definite if $f^TMf\geq 0$ for any $f\in \mathbb{R}^n$. Then eigenvalues must be non-negative or:
$$ \lambda_1 \geq \lambda_2 \geq .. \geq \lambda_n \geq 0 $$
Reminder: Eigenvectors and eigenfunctions
The eigenvector of a matrix $M$ is obtained by: $$ Mv_j = \lambda v_j $$
Similarly, we can define eigenfunction when $M$ is limited in function space:
$$ Kf = \lambda_j f $$
Where by definition:
$$ (Kf)(x)= \int K(x,y) f_j(y)dy $$
Let’s suppose that $K$ has this magical property:
$$ \int K(x-y) f(y) dy $$
Then there is a name for this integral: convolution
. We are taking a kernel and then convolving that kernel concerning function $f$.
Hilbert Space
We define Hilbert’s space of functions and then determine the inner product and norm in that space of functions. $$ { \sum \alpha_j K(x_j,.) =f(.) } $$
Reminder
In kernel density estimation, we define kernels centered at data points $x_i$: $1/hK(\frac{x_i-.}{h})$. But, in Hilbert space, there are no data points yet (i.e., $x_i$), and we instead have parameter $x_j$. Furthermore, those $\alpha_j$ are arbitrary and could be either positive or negative. On the other hand, in kernel density estimation, those $\alpha_j$ are uniformly distributed like $1/n$.
Now let’s define norms and inter products in this space:
$$ ||f||_K = <f,f>_K $$
Let’s define functions:
$$ \begin{aligned} f(.) = \sum_j \alpha_j K(x_j,.) \\ g(.) = \sum_i \beta_i K(x_i,.) \end{aligned} $$
Then inner product is defined: $$ <f,g> = \sum_{i,j} \alpha_j \beta_i K(x_j,y_i) $$
Then we can define the norm:
$$ ||f||_K^2 = \sum \alpha_i \alpha_j K(x_i,x_j) $$
(note: the dimension of this space is infinite).
Is $<f,g>$ well defined?
If we define the same function in two different ways, we will have a problem. If we have another representation for $f$ and another one for $g$, does the inteproduct change? If it does, we will have a problem.
$$ \sum_{i,j} \alpha_j \beta_i K(x_j,x_i) $$
Remember that kernel space has the reproducing property:
$$ <K(x_i,.),f> = f(x) $$
Then:
$$ \begin{aligned} & \sum_{i,j} \alpha_j \beta_i <K(x_j,.),K(x_i,.)> \\ &= \sum_j \beta_j <K(x_i,.),\sum_i \beta_i K(y_i,.)>_K \\ &= \sum_j \beta_j <K(x_j,.),g(.)>_K \\ &= \sum_j \beta_j g(x_j) \end{aligned} $$
which doesn’t depend on $\beta_i$ and $y_i$. Then it doesn’t depend on how I represent $g$. Then this is well defined. This property does a lot of stuff work with the Mercer kernel. In kernel trick, you can take an algorithm and then Kernelize it.
Representer Theorem: Makes regression over RKHS works
What presenter theorem says the following is infinite dimensional regression:
$$ \begin{aligned} \hat{f}=\arg \min ||y-f(x)||^2+\lambda||f||_K^2 \\ \end{aligned} $$
This is basically minimizing $\sum (y_i-f(x_i))^2$ over training data. Then representer theorem also says that the following is a finite dimensional optimization:
$$ \hat{f} = \sum_i^{n} \alpha_i K(x_i,.) \quad \text{for some } \alpha_i \in \mathbb{R}^n $$
By solving this:
$$ \hat{\alpha} = \arg \min_{\alpha in \mathbb{R}^n} || Y -K \alpha ||^2 + \alpha^TK\alpha $$
where $K =[K(x_i,x_j)]_{n\times n} $
Why is this true?
Let’s define:
$$ V = \text{span } \Big(K(x_1,.), K(x_2,.), ..,K(x_n,.)\Big) $$
Let’s define reproducing kernel Hilbert space:
$$ \mathcal{H}_K = V \oplus V^{\perp} $$
is infinite dimmensional. $ V^{\perp}$ is the set of functions that are orthogonal to each one of $K(x_1,.), K(x_2,.), .., K(x_n,.)$.
$$ \begin{aligned} f \in \mathcal{H} : f = & v+g \\ & v \in V \\ & g \in V^{\perp} \end{aligned} $$
Then we can write:
$$ \begin{aligned} f(x_i) = & <K(x_i,.),f> \\ = & <K(x_i,.),v+g> \\ = & <K(x_i,.),v> + <K(x_i,.),g> \end{aligned} $$
By definition, $g$ is perpendicular to all of these functions. Then:
$$ <K(x_i,.),g>=0 $$
Remark: When I take a datapoint $x_i$ and evaluate it on $f$, it doesn’t depend on what is on orthogonal complement $g$.
Now let’s do the same thing for the norm:
$$ \begin{aligned} ||f||_K^2 = & ||v||_K^2 + ||g||_K^2 \\ =& \sum \alpha^T K \alpha \end{aligned} $$
We showed that squared error doesn’t depend on $g$. Now it’s time for the penalty part. To minimize squared error, we can always mess with $\alpha$. But I can always reduce the penalty by taking $g$ small and more petite for a given squared error. Then minimizer has always $g=0$
Remark: We showed that $\arg \min ||y-f(x)||^2+\lambda||f||_K^2$ is finite dimmensional ridge regression.
Now let’s get back to matrices. Let’s define $M$ is symmetric and positive definite matrix (i.e., $\lambda_j$s are real and $\lambda_j\geq 0$):
$$ M = U^T \Lambda U $$
where $U$ is a orthogonal matrix and $\Lambda$ is a diagonal matrix.
$$ \Lambda = \begin{bmatrix} \lambda_1 & .. & \\ \vdots &\ddots& \vdots \\ & .. &\lambda_n \end{bmatrix} $$
Then we can write: $$ \begin{aligned} M =& U^T \Lambda^{1/2} \Lambda^{1/2} U \\ =& (\sqrt{\Lambda} U)^T(\sqrt{\Lambda} U) \\ = & \Phi^T \Phi \end{aligned} $$
factors $\Phi \in \mathbb{R}^n$.
Feature representation for Mercer kernels
These factors are critical in machine learning because we can take them as features. For example, feature vector $\Phi_i \in \mathbb{R}^n$ for $i-$th data point. Indeed, features are dual forms for kernels. For example, we can take $X$ and make $G=XX^T$. We can show that it’s a Mercer’s kernel because it’s positive and semi-definite. If I take Eigen functions of k:
$$ K \Phi_j = \lambda_j \Phi_j, \quad \lambda_j \geq 0 $$
Then we can define feature mapping:
$$ X \rightarrow \Phi(x) : (\sqrt{\lambda} \Phi_1,\sqrt{\lambda_2}\Phi_2,..) \quad \text{infinite dimmensional} $$
Indeed, Mercer’s kernel gives you another way of manipulating features.
We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty, where I was TF.