Let's touch base on kernel methods

· 12 min read

Many classical models are linear in parameters, which makes optimization easy but often limits expressiveness. Kernel methods give a way to keep convex training while modeling nonlinear patterns.

The spine of this post is: ERM in linear form, feature maps for nonlinearity, and the kernel trick that avoids explicit high-dimensional computations.

Suppose data points are $(x_1,y_1)$, $(x_2,y_2)$, .., $(x_n, y_n)$. A standard ERM objective is:

$$ \min_{w\in \mathbb{R}^d} \sum_{i=1}^n L(y_i,w.x_i)+\lambda \text{Reg}(w) $$

You can optimize this with least squares and (stochastic) gradient descent. The challenge appears when the relation between $x$ and $y$ is nonlinear: how do we keep optimization manageable without giving up expressive models?

How to extend this to nonlinear models?

We can do regression (predict $y$ from data $x\in \mathbb{R}^d$) using a nonlinear function $f(x)$ of the form:

$$ f(x) = w. \phi (x) $$

for a feature map $\phi: \mathbb{R}^d \rightarrow \mathbb{R}^M$.

This captures polynomials. Then, for one dimensional $x$ we can write $f(x)$ as:

$$ d=1 : \begin{align} f(x) = & c_0+c_1x+c_2x^2 + .. + c_px^p \\\ =& \begin{bmatrix} c_0 \\\ c_1 \\\ \vdots \\\ c_p \end{bmatrix}^T. \begin{bmatrix} 1 \\\ x \\\ \vdots \\\ x^p \end{bmatrix} \\\ =& w.\phi(x) \qquad x \in \mathbb{R} , \phi(x) \in \mathbb{R}^d \end{align} $$

For high dimensional $x$:

$$ d>1 : \\\ f(x) = c_0+\sum_{i=1}^d c_ix_i+\sum_{i,j}c_{i,j}x_ix_j + .. + \sum_{i_1, .. i_p}c_{i}x_{i_1}x_{i_2}..x_{i_p} $$

$$ \begin{align} \phi(x)=& \begin{bmatrix} 1 \\\ x_i \\\ x_ix_j \\\ \vdots \\\ x_{i_1}x_{i_2}..x_{i_p} \end{bmatrix} \in \begin{bmatrix} \mathbb{R}^1 \\\ \mathbb{R}^d \\\ \mathbb{R}^{d^2} \\\ \vdots \\\ \mathbb{R}^{d^p} \end{bmatrix} \end{align} $$

This is the key trick: the model is nonlinear in $x$ but remains linear in parameters $w$, so optimization still looks like linear regression.

Let’s go over some examples:

$$ d=1: f(x) = e^x = 1+ x + \frac{x^2}{2} + \frac{x^3}{3!} + .. = \sum_{m=0}^{\inf} \frac{x^m}{m!} $$

$$ \phi(x)= \begin{bmatrix} 1 \\\ x \\\ \vdots \\\ x^p \end{bmatrix} $$

$$ d>1: f(x) = e^{w.x} = 1+w.x + \frac{(w.x)^2}{2} + .. = \sum_{m=0}^{\inf} \frac{(w.x)^m}{m!} $$

Which is derived from Taylor expansion for the exponential function.

Can we learn $w$ for $f(x)=w.\phi(x)$ without explicitly constructing high-dimensional $\phi(x)$?

Yes. We only need inner products between feature vectors. That quantity is the kernel function:

$$ k(x,x’) = \phi(x).\phi(x’) $$

Now apply nonlinear regularized least squares with $f(x) = w.\phi(x)$ on training pairs $(x_i,y_i)$:

$$ \min_{w\in \mathbb{R}^M} \sum_{i=1}^n (y_i-w.\phi(x))^2+\frac{\lambda}{2} ||w||^2 $$

Note that this is a quadratic function in $w$:

$$ L(w) = \sum_{i=1}^n (y_i-w.\phi(x_i))^2+\frac{\lambda}{2} ||w||^2 $$

To minimize that solve: $\nabla L (w^*)=0$:

$$ \nabla L(w) = \sum_{i=1}^n (y_i-w.\phi(x_i))(-\phi (x_i))+ \lambda||w|| $$

Then we have:

$$ w^* = \frac{1}{\lambda} \sum_{i=1}^n (y_i-w^*.\phi(x_i))\phi (x_i) $$

let’s call $\alpha_i = y_i-w^*.\phi(x_i)$ a residual $\in \mathbb{R}$:

$$ w^* = \frac{1}{\lambda} \sum_{i=1}^n \alpha_i \phi (x_i) $$

So the solution $w^*$ lies in the span of training feature vectors $\phi(x_1),\phi(x_2),..,\phi(x_n)$.

How to find residuals?

$$ \begin{align} \alpha_i = & y_i - w^*.\phi(x_i) \\ =& y_i- \frac{1}{\lambda}(\sum_{j=1}^n \alpha_j \phi(x_j) ). \phi(x_i)\\ =&y_i-\frac{1}{\lambda} \sum_{j=1}^n \alpha_j \phi(x_j) . \phi(x_i) \end{align} $$

we know that $K(x_i,x_j) = K(x_j,x_i)$ :

$$ \begin{align} \alpha_i = y_i-\frac{1}{\lambda} \sum_{j=1}^n \alpha_j K(x_i,x_j) \end{align} $$

Hence:

$$ \begin{bmatrix} \alpha_1 \\\ \vdots \\\ \alpha_n \end{bmatrix} = \begin{bmatrix} y_1 \\\ \vdots \\\ y_n \end{bmatrix} - \frac{1}{\lambda} \begin{bmatrix} K(x_1,x_1) & \cdots K(x_1,x_n)\\\ \vdots & \vdots \\\ K(x_n,x_1) & \cdots K(x_n,x_n) \end{bmatrix} \begin{bmatrix} \alpha_1 \\\ \vdots \\\ \alpha_n \end{bmatrix} $$

And in a matrix form:

$$ \alpha = y - \frac{1}{\lambda} K \alpha $$ $$ \rightarrow \alpha = (\lambda I + k )^{-1}y. $$

Finally, our solution is:

$$ \begin{align} w^* = & \frac{1}{\lambda} \sum_{i=1}^n \alpha_i \phi (x_i)\\\ =&\frac{1}{\lambda} [\phi(x_1),\phi(x_2), .., \phi(x_n)]\begin{bmatrix} \alpha_1 \\\ \vdots \\\ \alpha_n \end{bmatrix} \qquad \Phi \in \mathbb{R}^{d\times n} \end{align} $$

How to predict (compute $f(x)$)

$$ \begin{align} f(x) = & w^* . \phi(x)\\\ =&\frac{1}{\lambda} (\sum_i \alpha_i \phi (x_i)). \phi(x) \\\ =&\frac{1}{\lambda} \sum_i \alpha_i K(x_i,x) \end{align} $$

This final expression is the practical payoff: prediction depends on kernel evaluations against training points, not on constructing $\phi(x)$ explicitly.

This material was covered in the introduction to machine learning CPCS 481/581, Yale University, Andre Wibisono where I (joint with Siddharth Mitra) was TF.