# Neural Tangent Kernels

Let’s start with a simple regression method. Let’s assume that we have a dataset of $n$ points ${(x_i,y_i)}_{i=1}^n$ where $y_i \in \mathbb{R}$ and $x_i \in \mathbb{R}^d$:

$$ f(w,x) = W^T X $$

is a simple linear model that you can define for your data points. Then a simple loss function could be:

$$ \arg \min_w \mathcal{L}(W) = \frac{1}{2} \sum_{i=1}^n (y_i - f(W,x_i))^2 $$

where $\hat{y}_i = f(W,x_i)$ is our prediction. This is a simple convex optimization problem where each step is defined by:

$$ \begin{aligned} W(t+1) = & W(t) - \eta_t \bigtriangledown \mathcal{L}(w_t) \\ = & W(t) - \eta_t \sum_{i=1}^n (y_i-f(W,x_i))\bigtriangledown_wf(W_t,x_i) \\ =& W(t) - \eta_t \sum_{i=1}^n (y_i-f(W,x_i))x_i \end{aligned} $$

## Kernel Method:

We previously introduced kernel methods here. Indeed, instead of representing our data points $x_i \in \mathbb{R}^d$ we represent them in non linear transformation space, possibly with larger dimmensions $\phi(x_i) \in \mathbb{R}^D$ for $D \gg d$.

### Example: Polynomial kernels

Assume: $$ \begin{align} \phi(x)=& \begin{bmatrix} 1 \\\ x_i \\\ x_ix_j \\\ \vdots \\\ x_{i_1}x_{i_2}..x_{i_p} \end{bmatrix} \in \begin{bmatrix} \mathbb{R}^1 \\\ \mathbb{R}^d \\\ \mathbb{R}^{d^2} \\\ \vdots \\\ \mathbb{R}^{d^p} \end{bmatrix} \end{align} $$

Then, our model would be:

$$ f(W,x) = W^T \phi(X) $$

**Question:**Is this model linear in $W$?**answer:**Yes**Question:**Is this model linear in $X$?**answer:**No

$$ \arg \min_w \mathcal{L}(W) = \frac{1}{2} \sum_{i=1}^n (y_i - W^T\phi(x_i))^2 $$

But, here we have two issues: 1. $\phi(.)$ is fixed and 2. We will suffer from curse of dimmensionality for $\phi(X) \in \mathbb{R}^D$ for $D\gg d$ or $\mathcal{O}(d^k)$ for polynomial with order $k$.

### Kernel Trick:

In most cases we don’t really need to calculate $\phi(X)$ individually. Instead:

$$ K(x_i,x_j) = <\phi(x_i),\phi(x_j)> $$

Then a kernel matrix $K\in \mathbb{R}^{n\times n}$ represent some measures of similarity between data points. Kernel matrix is symmetric and positive semi definite. In many cases, without explicit computation of $\phi(x_i)$ we can compute $K(x_i,x_j)$.

## Neural Networks:

Let’s define a simple neural network with two layers:

$$ y = f(W,x) = \textcolor{red}{\frac{1}{\sqrt(m)}}\sum_{i=1}^m b_i \sigma(a_i’X) $$

The scaling factor $\textcolor{red}{\frac{1}{\sqrt(m)}}$ will have some magical behavior that we will touch base later.

$$ \mathcal{L}(W) = \frac{1}{2} \sum_{i=1}^n ( f(W,x_i)-y_i)^2 $$

$$ \begin{aligned} W(t+1) = & W(t) - \eta_t \bigtriangledown \mathcal{L}(w_t) \\ = & W(t) - \eta_t \sum_{i=1}^n (f(W,x_i)-y_i)\bigtriangledown_wf(W_t,x_i) \end{aligned} $$

In linear model $\bigtriangledown_wf(W_t,x_i)=x_i$ or static and was not changing. Let’s initialize weights with $\mathcal{N}(0,1)$:

$$ W(0) \rightarrow W(1) \rightarrow W(2) .. $$

**Empirical observation**: When $m$ (width of the network) is large, these matrices along the trajectory of gradient descent are almost `static`

.

In machine learning this phenomena is called “Lazy Training”.

## First Order Taylor’s Approximation:

Because changes in $W(t)$ are small we can do Taylor’s approximation:

$$ f(W,X) \approx f(W_0,X) + \bigtriangledown_wf(W_0,X)^T(W-W_0) $$

**Question:**Is this approximate linear in $W$?**answer:**Yes**Question:**Is this approximate linear in $X$?**answer:**No

Let’s call:

$$ \phi(X) = \bigtriangledown_wf(W_0,X) $$

non linear transformation of $X$ (instead of polynomial transformation). Then,

$$ K(x_i,x_j) = <\phi(x_i),\phi(x_j)> $$

is also called **Neural Tangent Kernel** (NTK).
Here, my transformation is coming from neural network evaluated at point $W_0$.
So it’s a fixed transformation and is not going to change across gradient descent trajectory.
So everything is consistent so far (like optimization is convex and $\phi(.)$ is fixed).

$$ f_m(W,x) = \textcolor{red}{\frac{1}{\sqrt(m)}}\sum_{i=1}^m b_i \sigma(a_i’X) $$

Then: $$ \begin{align} \bigtriangledown_{a_i}f_m(W,X) = \textcolor{red}{\frac{1}{\sqrt(m)}}b_i \sigma’(a_i’X)X \\ \bigtriangledown_{b_i}f_m(W,X) = \textcolor{red}{\frac{1}{\sqrt(m)}}\sigma (a_i’X) \end{align} $$

Therefore:

$$ K_m(x,x’) = K_m^a(x,x’) + K_m^b(x,x’) $$

where:

$$ \begin{align} K_m^a(x,x’) = \textcolor{red}{\frac{1}{m}}\sum_{i=1}^mb_i^2 \sigma’(a_i’x)\sigma’(a_i’x’)x.x’\\ K_m^b(x,x’) = \textcolor{red}{\frac{1}{m}}\sum_{i=1}^m\sigma (a_i’x) \sigma (a_i’x’) \end{align} $$

where $x.x’$ is the interproduct between two data points.

## Law of large numbers:

If $a_i$s and $b_i$s are independant sample at initialization based on law of large numbers when $m\rightarrow \infty$:

$$ \begin{align} K_m^a(x,x’) | {m\rightarrow \infty} \equiv K^a(x,x’) = \mathbb{E}[b^2\sigma’(a’x)\sigma’(a’x’)(x.x’)] \\ K_m^b(x,x’) | {m\rightarrow \infty} \equiv K^b(x,x’) = \mathbb{E}[\sigma(a.x)\sigma(a.x’)] \end{align} $$

interesting point is that these kernels $K^a(x,x’)$ and $K^b(x,x’)$ are `NTK`

and coming from infinite width.

### If $\sigma(.)$ is Relu and distribution of $a_i$ are rotation invariant, like Gaussian:

$$ K^a(x,x’) = \frac{(x.x’)\mathbb{E}[b^2]}{2\pi}(\pi -\theta(x,x’)) $$

where $\theta(x,x’) \in [0,\pi]$ is angle between two data points $x$ and $x’$. Similarly: $$ K^b(x,x’) = \frac{(||x||||x’||\mathbb{E}[a^2]}{2\pi d}\Big((\pi -\theta(x,x’))\cos \theta + \sin \theta\Big) $$

**Question:**When Taylor expansion is good?**answer:**If Hessian has bounded eigenvalues (Hessian constant). In one layer neural network we showed this.

## Analyze Gradient Descent:

If $\eta \rightarrow 0$ gradient flow:

$$ W(t+1) = W(t) - \eta \bigtriangledown_wf(W(t)) $$

Then:

$$ \frac{W(t+1) - W(t)}{\eta} = - \bigtriangledown_wf(W(t)) $$

If $\eta \rightarrow 0$ : $$ \frac{\partial W(t)}{\partial t} = - \bigtriangledown_w\mathcal{L}(W(t)) $$

Recall that:

$$ \mathcal{L}(W(t)) = \frac{1}{2} ||\hat{y}(w)-y||^2 , y\in \mathbb{R}^n , \hat{y} \in \mathbb{R}^n $$

Then: $$ \bigtriangledown_w\mathcal{L}(W(t)) = \bigtriangledown \hat{y}(W) (\hat{y}(W)-Y) $$

Therefore: $$ \frac{\partial W(t)}{\partial t} = - \bigtriangledown \hat{y}(W) (\hat{y}(W)-Y) $$

is called dynamics of weights in parameter space. How about our output?

$$ \begin{align} \frac{\partial \hat{y}(W(t))}{\partial t} = & \bigtriangledown \hat{y}(W)^T \frac{\partial W(t)}{\partial t} \\ = & - \bigtriangledown \hat{y}(W)^T \bigtriangledown \hat{y}(W) (\hat{y}(W)-Y) \\ = & - K(W_0) (\hat{y}(W)-Y) \end{align} $$

To simplify the notations lets assume $u = \hat{y}-y$. Then:

$$ \frac{\partial u}{\partial t} = - K(W_0) u $$

What is the soloution for this problem?

$$ u(t) = u(0) \exp(-K(W_0)t) $$

In overparameterization case:

$$ K(W_0) = \sum_{i=1}^n \lambda_i v_i v_i^T , 0 < \lambda_1< ..<\lambda_n $$

Then:

$$ u(t) = u(0) \prod_{i=1}^n \exp(-\lambda_i v_i v_i^T) $$

Therefore, the minimum eigenvalue will determine the rate of `convergence`

.

We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty where I was TF. Some of the notations are also borrowed from Soheil Feizi‘’s course on neural tangent kernel.