Neural Tangent Kernels
Let’s start with a simple regression method. Let’s assume that we have a dataset of $n$ points ${(x_i,y_i)}_{i=1}^n$ where $y_i \in \mathbb{R}$ and $x_i \in \mathbb{R}^d$:
$$ f(w,x) = W^T X $$
is a simple linear model that you can define for your data points. Then a simple loss function could be:
$$ \arg \min_w \mathcal{L}(W) = \frac{1}{2} \sum_{i=1}^n (y_i - f(W,x_i))^2 $$
where $\hat{y}_i = f(W,x_i)$ is our prediction. This is a simple convex optimization problem where each step is defined by:
$$ \begin{aligned} W(t+1) = & W(t) - \eta_t \bigtriangledown \mathcal{L}(w_t) \\ = & W(t) - \eta_t \sum_{i=1}^n (y_i-f(W,x_i))\bigtriangledown_wf(W_t,x_i) \\ =& W(t) - \eta_t \sum_{i=1}^n (y_i-f(W,x_i))x_i \end{aligned} $$
Kernel Method:
We previously introduced kernel methods here. Indeed, instead of representing our data points $x_i \in \mathbb{R}^d$ we represent them in non linear transformation space, possibly with larger dimmensions $\phi(x_i) \in \mathbb{R}^D$ for $D \gg d$.
Example: Polynomial kernels
Assume: $$ \begin{align} \phi(x)=& \begin{bmatrix} 1 \\\ x_i \\\ x_ix_j \\\ \vdots \\\ x_{i_1}x_{i_2}..x_{i_p} \end{bmatrix} \in \begin{bmatrix} \mathbb{R}^1 \\\ \mathbb{R}^d \\\ \mathbb{R}^{d^2} \\\ \vdots \\\ \mathbb{R}^{d^p} \end{bmatrix} \end{align} $$
Then, our model would be:
$$ f(W,x) = W^T \phi(X) $$
- Question: Is this model linear in $W$? answer: Yes
- Question: Is this model linear in $X$? answer: No
$$ \arg \min_w \mathcal{L}(W) = \frac{1}{2} \sum_{i=1}^n (y_i - W^T\phi(x_i))^2 $$
But, here we have two issues: 1. $\phi(.)$ is fixed and 2. We will suffer from curse of dimmensionality for $\phi(X) \in \mathbb{R}^D$ for $D\gg d$ or $\mathcal{O}(d^k)$ for polynomial with order $k$.
Kernel Trick:
In most cases we don’t really need to calculate $\phi(X)$ individually. Instead:
$$ K(x_i,x_j) = <\phi(x_i),\phi(x_j)> $$
Then a kernel matrix $K\in \mathbb{R}^{n\times n}$ represent some measures of similarity between data points. Kernel matrix is symmetric and positive semi definite. In many cases, without explicit computation of $\phi(x_i)$ we can compute $K(x_i,x_j)$.
Neural Networks:
Let’s define a simple neural network with two layers:
$$ y = f(W,x) = \textcolor{red}{\frac{1}{\sqrt(m)}}\sum_{i=1}^m b_i \sigma(a_i’X) $$
The scaling factor $\textcolor{red}{\frac{1}{\sqrt(m)}}$ will have some magical behavior that we will touch base later.
$$ \mathcal{L}(W) = \frac{1}{2} \sum_{i=1}^n ( f(W,x_i)-y_i)^2 $$
$$ \begin{aligned} W(t+1) = & W(t) - \eta_t \bigtriangledown \mathcal{L}(w_t) \\ = & W(t) - \eta_t \sum_{i=1}^n (f(W,x_i)-y_i)\bigtriangledown_wf(W_t,x_i) \end{aligned} $$
In linear model $\bigtriangledown_wf(W_t,x_i)=x_i$ or static and was not changing. Let’s initialize weights with $\mathcal{N}(0,1)$:
$$ W(0) \rightarrow W(1) \rightarrow W(2) .. $$
Empirical observation: When $m$ (width of the network) is large, these matrices along the trajectory of gradient descent are almost static
.
In machine learning this phenomena is called “Lazy Training”.
First Order Taylor’s Approximation:
Because changes in $W(t)$ are small we can do Taylor’s approximation:
$$ f(W,X) \approx f(W_0,X) + \bigtriangledown_wf(W_0,X)^T(W-W_0) $$
- Question: Is this approximate linear in $W$? answer: Yes
- Question: Is this approximate linear in $X$? answer: No
Let’s call:
$$ \phi(X) = \bigtriangledown_wf(W_0,X) $$
non linear transformation of $X$ (instead of polynomial transformation). Then,
$$ K(x_i,x_j) = <\phi(x_i),\phi(x_j)> $$
is also called Neural Tangent Kernel (NTK). Here, my transformation is coming from neural network evaluated at point $W_0$. So it’s a fixed transformation and is not going to change across gradient descent trajectory. So everything is consistent so far (like optimization is convex and $\phi(.)$ is fixed).
$$ f_m(W,x) = \textcolor{red}{\frac{1}{\sqrt(m)}}\sum_{i=1}^m b_i \sigma(a_i’X) $$
Then: $$ \begin{align} \bigtriangledown_{a_i}f_m(W,X) = \textcolor{red}{\frac{1}{\sqrt(m)}}b_i \sigma’(a_i’X)X \\ \bigtriangledown_{b_i}f_m(W,X) = \textcolor{red}{\frac{1}{\sqrt(m)}}\sigma (a_i’X) \end{align} $$
Therefore:
$$ K_m(x,x’) = K_m^a(x,x’) + K_m^b(x,x’) $$
where:
$$ \begin{align} K_m^a(x,x’) = \textcolor{red}{\frac{1}{m}}\sum_{i=1}^mb_i^2 \sigma’(a_i’x)\sigma’(a_i’x’)x.x’\\ K_m^b(x,x’) = \textcolor{red}{\frac{1}{m}}\sum_{i=1}^m\sigma (a_i’x) \sigma (a_i’x’) \end{align} $$
where $x.x’$ is the interproduct between two data points.
Law of large numbers:
If $a_i$s and $b_i$s are independant sample at initialization based on law of large numbers when $m\rightarrow \infty$:
$$ \begin{align} K_m^a(x,x’) | {m\rightarrow \infty} \equiv K^a(x,x’) = \mathbb{E}[b^2\sigma’(a’x)\sigma’(a’x’)(x.x’)] \\ K_m^b(x,x’) | {m\rightarrow \infty} \equiv K^b(x,x’) = \mathbb{E}[\sigma(a.x)\sigma(a.x’)] \end{align} $$
interesting point is that these kernels $K^a(x,x’)$ and $K^b(x,x’)$ are NTK
and coming from infinite width.
If $\sigma(.)$ is Relu and distribution of $a_i$ are rotation invariant, like Gaussian:
$$ K^a(x,x’) = \frac{(x.x’)\mathbb{E}[b^2]}{2\pi}(\pi -\theta(x,x’)) $$
where $\theta(x,x’) \in [0,\pi]$ is angle between two data points $x$ and $x’$. Similarly: $$ K^b(x,x’) = \frac{(||x||||x’||\mathbb{E}[a^2]}{2\pi d}\Big((\pi -\theta(x,x’))\cos \theta + \sin \theta\Big) $$
- Question: When Taylor expansion is good? answer: If Hessian has bounded eigenvalues (Hessian constant). In one layer neural network we showed this.
Analyze Gradient Descent:
If $\eta \rightarrow 0$ gradient flow:
$$ W(t+1) = W(t) - \eta \bigtriangledown_wf(W(t)) $$
Then:
$$ \frac{W(t+1) - W(t)}{\eta} = - \bigtriangledown_wf(W(t)) $$
If $\eta \rightarrow 0$ : $$ \frac{\partial W(t)}{\partial t} = - \bigtriangledown_w\mathcal{L}(W(t)) $$
Recall that:
$$ \mathcal{L}(W(t)) = \frac{1}{2} ||\hat{y}(w)-y||^2 , y\in \mathbb{R}^n , \hat{y} \in \mathbb{R}^n $$
Then: $$ \bigtriangledown_w\mathcal{L}(W(t)) = \bigtriangledown \hat{y}(W) (\hat{y}(W)-Y) $$
Therefore: $$ \frac{\partial W(t)}{\partial t} = - \bigtriangledown \hat{y}(W) (\hat{y}(W)-Y) $$
is called dynamics of weights in parameter space. How about our output?
$$ \begin{align} \frac{\partial \hat{y}(W(t))}{\partial t} = & \bigtriangledown \hat{y}(W)^T \frac{\partial W(t)}{\partial t} \\ = & - \bigtriangledown \hat{y}(W)^T \bigtriangledown \hat{y}(W) (\hat{y}(W)-Y) \\ = & - K(W_0) (\hat{y}(W)-Y) \end{align} $$
To simplify the notations lets assume $u = \hat{y}-y$. Then:
$$ \frac{\partial u}{\partial t} = - K(W_0) u $$
What is the soloution for this problem?
$$ u(t) = u(0) \exp(-K(W_0)t) $$
In overparameterization case:
$$ K(W_0) = \sum_{i=1}^n \lambda_i v_i v_i^T , 0 < \lambda_1< ..<\lambda_n $$
Then:
$$ u(t) = u(0) \prod_{i=1}^n \exp(-\lambda_i v_i v_i^T) $$
Therefore, the minimum eigenvalue will determine the rate of convergence
.
We covered this post in the intermediate machine learning SDS 365/565, Yale University, John Lafferty where I was TF. Some of the notations are also borrowed from Soheil Feizi‘’s course on neural tangent kernel.