Neural Tangent Kernels

Let’s start with a simple regression method. Let’s assume that we have a dataset of $n$ points ${(x_i,y_i)}_{i=1}^n$ where $y_i \in \mathbb{R}$ and $x_i \in \mathbb{R}^d$:

$$ f(w,x) = W^T X $$

is a simple linear model that you can define for your data points. Then a simple loss function could be:

$$ \arg \min_w \mathcal{L}(W) = \frac{1}{2} \sum_{i=1}^n (y_i - f(W,x_i))^2 $$

where $\hat{y}_i = f(W,x_i)$ is our prediction. This is a simple convex optimization problem where each step is defined by:

$$ \begin{aligned} W(t+1) = & W(t) - \eta_t \bigtriangledown \mathcal{L}(w_t) \\ = & W(t) - \eta_t \sum_{i=1}^n (y_i-f(W,x_i))\bigtriangledown_wf(W_t,x_i) \\ =& W(t) - \eta_t \sum_{i=1}^n (y_i-f(W,x_i))x_i \end{aligned} $$

Kernel Method:

We previously introduced kernel methods here. Indeed, instead of representing our data points $x_i \in \mathbb{R}^d$ we represent them in non linear transformation space, possibly with larger dimmensions $\phi(x_i) \in \mathbb{R}^D$ for $D \gg d$.

Example: Polynomial kernels

Assume: $$ \begin{align} \phi(x)=& \begin{bmatrix} 1 \\\ x_i \\\ x_ix_j \\\ \vdots \\\ x_{i_1}x_{i_2}..x_{i_p} \end{bmatrix} \in \begin{bmatrix} \mathbb{R}^1 \\\ \mathbb{R}^d \\\ \mathbb{R}^{d^2} \\\ \vdots \\\ \mathbb{R}^{d^p} \end{bmatrix} \end{align} $$

Then, our model would be:

$$ f(W,x) = W^T \phi(X) $$

  1. Question: Is this model linear in $W$? answer: Yes
  2. Question: Is this model linear in $X$? answer: No

$$ \arg \min_w \mathcal{L}(W) = \frac{1}{2} \sum_{i=1}^n (y_i - W^T\phi(x_i))^2 $$

But, here we have two issues: 1. $\phi(.)$ is fixed and 2. We will suffer from curse of dimmensionality for $\phi(X) \in \mathbb{R}^D$ for $D\gg d$ or $\mathcal{O}(d^k)$ for polynomial with order $k$.

Kernel Trick:

In most cases we don’t really need to calculate $\phi(X)$ individually. Instead:

$$ K(x_i,x_j) = <\phi(x_i),\phi(x_j)> $$

Then a kernel matrix $K\in \mathbb{R}^{n\times n}$ represent some measures of similarity between data points. Kernel matrix is symmetric and positive semi definite. In many cases, without explicit computation of $\phi(x_i)$ we can compute $K(x_i,x_j)$.

Neural Networks:

Let’s define a simple neural network with two layers:

$$ y = f(W,x) = \textcolor{red}{\frac{1}{\sqrt(m)}}\sum_{i=1}^m b_i \sigma(a_i’X) $$

The scaling factor $\textcolor{red}{\frac{1}{\sqrt(m)}}$ will have some magical behavior that we will touch base later.

$$ \mathcal{L}(W) = \frac{1}{2} \sum_{i=1}^n ( f(W,x_i)-y_i)^2 $$

$$ \begin{aligned} W(t+1) = & W(t) - \eta_t \bigtriangledown \mathcal{L}(w_t) \\ = & W(t) - \eta_t \sum_{i=1}^n (f(W,x_i)-y_i)\bigtriangledown_wf(W_t,x_i) \end{aligned} $$

In linear model $\bigtriangledown_wf(W_t,x_i)=x_i$ or static and was not changing. Let’s initialize weights with $\mathcal{N}(0,1)$:

$$ W(0) \rightarrow W(1) \rightarrow W(2) .. $$

Empirical observation: When $m$ (width of the network) is large, these matrices along the trajectory of gradient descent are almost static.

Example

In machine learning this phenomena is called “Lazy Training”.

First Order Taylor’s Approximation:

Because changes in $W(t)$ are small we can do Taylor’s approximation:

$$ f(W,X) \approx f(W_0,X) + \bigtriangledown_wf(W_0,X)^T(W-W_0) $$

  1. Question: Is this approximate linear in $W$? answer: Yes
  2. Question: Is this approximate linear in $X$? answer: No

Let’s call:

$$ \phi(X) = \bigtriangledown_wf(W_0,X) $$

non linear transformation of $X$ (instead of polynomial transformation). Then,

$$ K(x_i,x_j) = <\phi(x_i),\phi(x_j)> $$

is also called Neural Tangent Kernel (NTK). Here, my transformation is coming from neural network evaluated at point $W_0$. So it’s a fixed transformation and is not going to change across gradient descent trajectory. So everything is consistent so far (like optimization is convex and $\phi(.)$ is fixed).

$$ f_m(W,x) = \textcolor{red}{\frac{1}{\sqrt(m)}}\sum_{i=1}^m b_i \sigma(a_i’X) $$

Then: $$ \begin{align} \bigtriangledown_{a_i}f_m(W,X) = \textcolor{red}{\frac{1}{\sqrt(m)}}b_i \sigma’(a_i’X)X \\ \bigtriangledown_{b_i}f_m(W,X) = \textcolor{red}{\frac{1}{\sqrt(m)}}\sigma (a_i’X) \end{align} $$

Therefore:

$$ K_m(x,x’) = K_m^a(x,x’) + K_m^b(x,x’) $$

where:

$$ \begin{align} K_m^a(x,x’) = \textcolor{red}{\frac{1}{m}}\sum_{i=1}^mb_i^2 \sigma’(a_i’x)\sigma’(a_i’x’)x.x’\\ K_m^b(x,x’) = \textcolor{red}{\frac{1}{m}}\sum_{i=1}^m\sigma (a_i’x) \sigma (a_i’x’) \end{align} $$

where $x.x’$ is the interproduct between two data points.

Law of large numbers:

If $a_i$s and $b_i$s are independant sample at initialization based on law of large numbers when $m\rightarrow \infty$:

$$ \begin{align} K_m^a(x,x’) | {m\rightarrow \infty} \equiv K^a(x,x’) = \mathbb{E}[b^2\sigma’(a’x)\sigma’(a’x’)(x.x’)] \\ K_m^b(x,x’) | {m\rightarrow \infty} \equiv K^b(x,x’) = \mathbb{E}[\sigma(a.x)\sigma(a.x’)] \end{align} $$

interesting point is that these kernels $K^a(x,x’)$ and $K^b(x,x’)$ are NTK and coming from infinite width.

If $\sigma(.)$ is Relu and distribution of $a_i$ are rotation invariant, like Gaussian:

$$ K^a(x,x’) = \frac{(x.x’)\mathbb{E}[b^2]}{2\pi}(\pi -\theta(x,x’)) $$

where $\theta(x,x’) \in [0,\pi]$ is angle between two data points $x$ and $x’$. Similarly: $$ K^b(x,x’) = \frac{(||x||||x’||\mathbb{E}[a^2]}{2\pi d}\Big((\pi -\theta(x,x’))\cos \theta + \sin \theta\Big) $$

  1. Question: When Taylor expansion is good? answer: If Hessian has bounded eigenvalues (Hessian constant). In one layer neural network we showed this.

Analyze Gradient Descent:

If $\eta \rightarrow 0$ gradient flow:

$$ W(t+1) = W(t) - \eta \bigtriangledown_wf(W(t)) $$

Then:

$$ \frac{W(t+1) - W(t)}{\eta} = - \bigtriangledown_wf(W(t)) $$

If $\eta \rightarrow 0$ : $$ \frac{\partial W(t)}{\partial t} = - \bigtriangledown_w\mathcal{L}(W(t)) $$

Recall that:

$$ \mathcal{L}(W(t)) = \frac{1}{2} ||\hat{y}(w)-y||^2 , y\in \mathbb{R}^n , \hat{y} \in \mathbb{R}^n $$

Then: $$ \bigtriangledown_w\mathcal{L}(W(t)) = \bigtriangledown \hat{y}(W) (\hat{y}(W)-Y) $$

Therefore: $$ \frac{\partial W(t)}{\partial t} = - \bigtriangledown \hat{y}(W) (\hat{y}(W)-Y) $$

is called dynamics of weights in parameter space. How about our output?

$$ \begin{align} \frac{\partial \hat{y}(W(t))}{\partial t} = & \bigtriangledown \hat{y}(W)^T \frac{\partial W(t)}{\partial t} \\ = & - \bigtriangledown \hat{y}(W)^T \bigtriangledown \hat{y}(W) (\hat{y}(W)-Y) \\ = & - K(W_0) (\hat{y}(W)-Y) \end{align} $$

To simplify the notations lets assume $u = \hat{y}-y$. Then:

$$ \frac{\partial u}{\partial t} = - K(W_0) u $$

What is the soloution for this problem?

$$ u(t) = u(0) \exp(-K(W_0)t) $$

In overparameterization case:

$$ K(W_0) = \sum_{i=1}^n \lambda_i v_i v_i^T , 0 < \lambda_1< ..<\lambda_n $$

Then:

$$ u(t) = u(0) \prod_{i=1}^n \exp(-\lambda_i v_i v_i^T) $$

Therefore, the minimum eigenvalue will determine the rate of convergence.

We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty where I was TF. Some of the notations are also borrowed from Soheil Feizi‘’s course on neural tangent kernel.