Learning Optimal Transport from Samples
Searching for optimal transport encourages a mapping that minimizes the total cost of transporting mass from $\alpha$ to $\beta$, as initially formulated by Monge (1781). Shape matching, data assimilation, and active brain transportation are among many exciting research areas leading such efforts to find appropriate mappings between $\alpha$ and $\beta$ for desired applications.
This blog post will indicate locations in source and target with $\alpha$ and $\beta$, respectively, but distributions with $a$ and $b$.
The motivation for this post comes from a recent paper from Makkuva et al., where they proposed a novel approach to learning optimal transport from samples (ICML 2020).
It is worth mentioning that optimal transport was initially proposed for resource allocation between different warehouses, traffic handling, allocating soldiers to other camps, etc..
Here, first, let’s chat about the Monge problem itself and then later explain the Makkuva et al., ’s sampling approach to learn the mappings.
Monge Problem
Lets define some resources $x_1,..,x_n$ in $\alpha$ and some resources $y_1,..,y_m$ in $\beta$. Then, we specify weight vectors $a$ and $b$ over these resources and define matrix $C$ as a measure of pairwise distances between points $x_i \in \alpha$ and relative points derived from $T (x_i)$ in $\beta$; Monge problem aims to solve the following optimization problem:
$$ \min_{ T} \\\{ \sum_i C(x_i, T (x_i)) : T_{\sharp} \alpha = \beta \\\} $$
Where push-forward operator $\sharp$ is used to exhibit the direction of the transportation plan.
Kantrovich Relaxation
Kantrovich proposed a relaxation to the Monge problem in which we can dispatch a bin from the source to multiple locations in the target. We define feasible solutions for this problem with $U$ as :
$$ U(a,b) = \\\{{T} \in \mathrm{R}^{n\times m}_+ : {T} \mathrm{1}_m =a , {T}^T\mathrm{1}_n=b\\\}, $$
It ensures that we are not losing any particle or mass during the transportation plan $T$ for vectors of all $1$ and distributions $a$ and $b$. An optimum solution is obtained by solving the following problem for a given ground metric matrix $C \in \mathbb{R}^{n\times m}$ : $$ L_c(a,b) = \min_{{T} \in U(a,b)} <C,{T}> = \sum_{i,j} C_{i,j} {T}_{i,j}. $$
Linear Programming:
If we do a little bit of linear algebra, we can formulate the Kantrovich relaxation with linear programming:
$$ L_c(\alpha,\beta) = \min_{{T}}C^T {T} \textbf{ s.t, } A\underline{{T}}= \begin{bmatrix} \alpha \\\ \beta \end{bmatrix} $$
Where:
Algorithmic solutions are well-established in the literature as they aim to solve the linear program. In high-dimensional (continuous and semi-discrete) space, the proposed algorithms don’t extend as they demand substantial computational cost and intractable solutions. Few lines of research suggested quantization of space to handle the curse of dimensionality. For example, A Gradual, Semi-Discrete Approach is proposed in ICML 2019 (by the UIUC group) to learn mappings in two steps:
- If the source randomness of the network is a distribution, then optimal transport is achieved by a deterministic evaluation.
- Let’s say we have an optimal transport mapping between a generator network and target distribution. Then, we can decrease the Wasserstein distance via a regression between the generated data and the mapped target points ($\Phi_1,\Phi_2,..,\Phi_N$): $$ L(x) = \frac{1}{B} \sum_{j=1}^B \min_i (c(x_j,y_i)-\Phi_i) + \frac{1}{N} \sum_{i=1}^N \Phi_i $$
However, this approach requires the evaluation of every single point. The heuristic idea is to sample some issues and add those sampled constraints as regularizers. Yet, this could lead to bias and not getting accurate distribution out of it. Makkuva et al., proposed an approach to learn optimal transport based on samples by exploring the set of all convex functions.
They establish their method using a dual form of Kantrovich’s relaxation:
$$ W(a,b) = \sup_{(f,g) \in \phi_c} \mathrm{E}_a[f(x)]+\mathrm{E}_b[g(y)] $$
where $\phi_c$ is the set of constrained functions defined by: $f(x)+g(y) \leq \frac{1}{2} ||x-y||^2$.
Long story short, the authors take advantage of Input-convex Neural Networks and explore functions $f$ and $g$ in a set of convex functions.
Input-convex Neural Networks
Input-convex Neural Networks are a class of deep neural networks whose outputs are convex concerning their inputs. The production of the network is defined recursively according to the following:
$$ f(x;\theta) = h_L \quad \text{where} \quad h_{l+1} = \sigma_l (W_lh_l+b_l+A_lx), \quad l=0,1,..,L-1 $$
Or based on the following schema:
Indeed, Makkuva et al., solve the Kantrovich dual via ICNN functions without introducing new regularization.
$$ W(a,b) = C_{a,b} -\inf_{(f,g) \in \phi_c} \mathrm{E}_a[f(x)]+\mathrm{E}_b[g(y)] $$
in which $C_{a,b} = \frac{1}{2} \mathrm{E}[x^2] + \frac{1}{2} \mathrm{E}[y^2]$ achieved from : $f(x)+g(y) \leq 1/2||x-y||^2$. However, we can reduce this to only learning one function based on Villani, 2013:
$$ W(a,b) = C_{a,b} -\inf_{(f,f^) \in \text{icnn}(a)} \mathrm{E}_a[f(x)]+\mathrm{E}_b[f^(y)] $$
for $f^*(y) = \sup_x <x,y> -f(x)$ is the convex conjugate of $f(.)$.
At the end, we can define optimal transport map $\triangledown f^*$:
$$ \mathrm{E}b ||\triangledown f^*(y)- y||^2 = \inf{T: T_{\sharp}a=b} \mathrm{E}_b ||T(y)-y||^2 $$
It is derived from Brenier’s Theorem.
Brenier’s Theorem
This is based on Brenier’s theorem that if b admits a density concerning the Lebesgue measure on $\mathrm{R}^d$, then there is a unique optimal coupling $\pi$ for the first problem:
$$ d \pi (x,y) = df(y) \delta_{x=\triangledown f^*(y)}. $$