Representer Theorem
The representer theorem explains a key miracle in kernel methods: an optimization problem over an infinite-dimensional function space can collapse to a finite problem in terms of training points.
We will build that claim step by step, starting from Mercer kernels and RKHS geometry, then showing exactly why the solution lives in the span of kernel evaluations at the data.
From Mercer’s theorem, a kernel satisfies $\int f(x) f(y) k(x,y) dx dy \geq 0$ for any function $f$. We call a matrix $M$ positive semi-definite if $f^TMf\geq 0$ for any $f\in \mathbb{R}^n$. Then eigenvalues must be non-negative:
$$ \lambda_1 \geq \lambda_2 \geq .. \geq \lambda_n \geq 0 $$
Reminder: Eigenvectors and eigenfunctions
The eigenvector of a matrix $M$ is obtained by: $$ Mv_j = \lambda v_j $$
Similarly, we can define eigenfunction when $M$ is limited in function space:
$$ Kf = \lambda_j f $$
Where by definition:
$$ (Kf)(x)= \int K(x,y) f_j(y)dy $$
Let’s suppose that $K$ has this magical property:
$$ \int K(x-y) f(y) dy $$
Then there is a name for this integral: convolution. We are taking a kernel and then convolving that kernel concerning function $f$.
Hilbert Space
We define Hilbert’s space of functions and then determine the inner product and norm in that space of functions. $$ { \sum \alpha_j K(x_j,.) =f(.) } $$
Reminder
In kernel density estimation, we define kernels centered at data points $x_i$: $1/hK(\frac{x_i-.}{h})$. But, in Hilbert space, there are no data points yet (i.e., $x_i$), and we instead have parameter $x_j$. Furthermore, those $\alpha_j$ are arbitrary and could be either positive or negative. On the other hand, in kernel density estimation, those $\alpha_j$ are uniformly distributed like $1/n$.
Now let’s define norms and inner products in this space:
$$ ||f||_K = <f,f>_K $$
Let’s define functions:
$$ \begin{aligned} f(.) = \sum_j \alpha_j K(x_j,.) \\ g(.) = \sum_i \beta_i K(x_i,.) \end{aligned} $$
Then inner product is defined: $$ <f,g> = \sum_{i,j} \alpha_j \beta_i K(x_j,y_i) $$
Then we can define the norm:
$$ ||f||_K^2 = \sum \alpha_i \alpha_j K(x_i,x_j) $$
(note: the dimension of this space is infinite).
Is $<f,g>$ well defined?
If we define the same function in two different ways, we will have a problem. If we have another representation for $f$ and another one for $g$, does the inner product change? If it does, we will have a problem.
$$ \sum_{i,j} \alpha_j \beta_i K(x_j,x_i) $$
Remember that kernel space has the reproducing property:
$$ <K(x_i,.),f> = f(x) $$
Then:
$$ \begin{aligned} & \sum_{i,j} \alpha_j \beta_i <K(x_j,.),K(x_i,.)> \\ &= \sum_j \beta_j <K(x_i,.),\sum_i \beta_i K(y_i,.)>_K \\ &= \sum_j \beta_j <K(x_j,.),g(.)>_K \\ &= \sum_j \beta_j g(x_j) \end{aligned} $$
which doesn’t depend on $\beta_i$ and $y_i$. Then it doesn’t depend on how I represent $g$. Then this is well defined. This property does a lot of stuff work with the Mercer kernel. In kernel trick, you can take an algorithm and then Kernelize it.
Representer Theorem: Makes regression over RKHS work
What the representer theorem says the following is infinite dimensional regression:
$$ \begin{aligned} \hat{f}=\arg \min ||y-f(x)||^2+\lambda||f||_K^2 \\ \end{aligned} $$
This is basically minimizing $\sum (y_i-f(x_i))^2$ over training data. Then representer theorem also says that the following is a finite dimensional optimization:
$$ \hat{f} = \sum_i^{n} \alpha_i K(x_i,.) \quad \text{for some } \alpha_i \in \mathbb{R}^n $$
By solving this:
$$ \hat{\alpha} = \arg \min_{\alpha \in \mathbb{R}^n} || Y -K \alpha ||^2 + \alpha^TK\alpha $$
where $K =[K(x_i,x_j)]_{n\times n} $
Why is this true?
Let’s define:
$$ V = \text{span } \Big(K(x_1,.), K(x_2,.), ..,K(x_n,.)\Big) $$
Let’s define reproducing kernel Hilbert space:
$$ \mathcal{H}_K = V \oplus V^{\perp} $$
is infinite-dimensional. $ V^{\perp}$ is the set of functions that are orthogonal to each one of $K(x_1,.), K(x_2,.), .., K(x_n,.)$.
$$ \begin{aligned} f \in \mathcal{H} : f = & v+g \\ & v \in V \\ & g \in V^{\perp} \end{aligned} $$
Then we can write:
$$ \begin{aligned} f(x_i) = & <K(x_i,.),f> \\ = & <K(x_i,.),v+g> \\ = & <K(x_i,.),v> + <K(x_i,.),g> \end{aligned} $$
By definition, $g$ is perpendicular to all of these functions. Then:
$$ <K(x_i,.),g>=0 $$
Remark: When I take a datapoint $x_i$ and evaluate it on $f$, it doesn’t depend on what is on orthogonal complement $g$.
Now let’s do the same thing for the norm:
$$ \begin{aligned} ||f||_K^2 = & ||v||_K^2 + ||g||_K^2 \\ =& \sum \alpha^T K \alpha \end{aligned} $$
We showed that squared error doesn’t depend on $g$. Now it’s time for the penalty part. To minimize squared error, we can always adjust $\alpha$. But I can always reduce the penalty by taking $g$ small and smaller for a given squared error. Then minimizer has always $g=0$
Remark: We showed that $\arg \min ||y-f(x)||^2+\lambda||f||_K^2$ is finite-dimensional ridge regression.
Now let’s get back to matrices. Let’s define $M$ is symmetric and positive definite matrix (i.e., $\lambda_j$s are real and $\lambda_j\geq 0$):
$$ M = U^T \Lambda U $$
where $U$ is a orthogonal matrix and $\Lambda$ is a diagonal matrix.
$$ \Lambda = \begin{bmatrix} \lambda_1 & .. & \\ \vdots &\ddots& \vdots \\ & .. &\lambda_n \end{bmatrix} $$
Then we can write: $$ \begin{aligned} M =& U^T \Lambda^{1/2} \Lambda^{1/2} U \\ =& (\sqrt{\Lambda} U)^T(\sqrt{\Lambda} U) \\ = & \Phi^T \Phi \end{aligned} $$
factors $\Phi \in \mathbb{R}^n$.
Feature representation for Mercer kernels
These factors are critical in machine learning because we can take them as features. For example, feature vector $\Phi_i \in \mathbb{R}^n$ for $i-$th data point. Indeed, features are dual forms for kernels. For example, we can take $X$ and make $G=XX^T$. We can show that it’s a Mercer’s kernel because it’s positive and semi-definite. If I take Eigen functions of k:
$$ K \Phi_j = \lambda_j \Phi_j, \quad \lambda_j \geq 0 $$
Then we can define feature mapping:
$$ X \rightarrow \Phi(x) : (\sqrt{\lambda} \Phi_1,\sqrt{\lambda_2}\Phi_2,..) \quad \text{infinite-dimensional} $$
Indeed, Mercer’s kernel gives you another way of manipulating features.
We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty, where I was TF.