header

gradiator

We want to do backpropagation on $f:ℝ^m→ℝ^n$ so we write the following pytorch function

def gradiator(fun,grad_fun): class GradiatorFunction(torch.autograd.Function): @staticmethod def forward(ctx, x): ctx.save_for_backward(x) # save input for backward pass return fun(x) @staticmethod def backward(ctx, grad_output): if not ctx.needs_input_grad[0]: return None # if grad not required, don't compute x, = ctx.saved_tensors # restore input from context grad_input = grad_fun(x,grad_output) return grad_input return GradiatorFunction.apply

Where fun designates $f$ and grad_fun designates the result of $grad_{out}^T ⋅ J_f(x) = grad_{in}^T$

How should we train grad_fun to behave properly ? $$\al{ grad\_fun(x,grad_{out}) &= grad_{out}^T ⋅ J_f(x) \\ ⇒ grad\_fun(x,grad_{out}) ⋅ dx &= grad_{out}^T ⋅ J_f(x) ⋅ dx \\ ⇒ grad\_fun(x,grad_{out}) ⋅ dx &= grad_{out}^T ⋅ df \\ }$$

And now I really want to use Moore-Penrose inverse $$\ca{ dx^{-1}_L := (dx^T⋅dx)^{-1}dx^T &⇐& dx^{-1}_L⋅dx = Id \\ dx^{-1}_R := dx^T(dx⋅dx^T)^{-1} &⇐& dx⋅dx^{-1}_R = Id }$$ But I can't because the one I needs corresponds requires to inverse a matrix of rank $1$...

Give it as many samples as there are dimensions. Then we can do : $$\al{ grad\_fun(x,grad_{out}) ⋅ dx ⋅ dx^{-1}_R &= grad_{out}^T ⋅ df ⋅ dx^{-1}_R \\ ⇒grad\_fun(x,grad_{out}) ⋅ dx ⋅ dx^T(dx⋅dx^T)^{-1} &= grad_{out}^T ⋅ df ⋅ dx^T(dx⋅dx^T)^{-1} \\ ⇒grad\_fun(x,grad_{out}) &= grad_{out}^T ⋅ df ⋅ dx^T(dx⋅dx^T)^{-1} \\ }$$

Let's optimize $grad\_fun$ to minimize $|grad\_fun ⋅ dx - grad_{out}^T ⋅ df|_2$

We'll assume that we have a few samples of $dx, df, grad_{out}$ at a time.

For simplicity, lets rename everthing :
$A:=grad\_fun$
$B:=dx$
$C:=grad_{out}^T ⋅ df$

Now we want to optimize $A = \argmin_A |A⋅B - C|_2$ $$\al{ |A⋅B - C|_2&:= \sup_{||x||_2=1} ||(A⋅B - C)⋅x||_2 \\ }$$

An interpretation of $grad\_fun ⋅ dx = grad_{out}^T ⋅ df$ is that $grad\_fun$ has a certain amplitude along $dx$ while we don't know anything about other axes. $$\al{ &grad\_fun ⋅ dx &= grad_{out}^T ⋅ df \\ ⇒ &grad\_fun ⋅ \frac{dx}{|dx|} &= \frac{grad_{out}^T ⋅ df}{|dx|} \\ }$$ So $$grad\_fun = \frac{dx}{|dx|} \frac{grad_{out}^T ⋅ df}{|dx|} + λ⋅dx^{\perp}$$ Where by $λ⋅dx^{\perp}$ we mean "some unknown contribution on ortogonal axes".

Now if $dx$ includes multiple samples, how can we improve $grad\_fun$ ? $$\ca{ grad\_fun\_1 = \frac{dx^1}{|dx^1|} \frac{grad_{out}^T ⋅ df^1}{|dx^1|} \\ grad\_fun\_2 = \frac{dx^2}{|dx^2|} \frac{grad_{out}^T ⋅ df^2}{|dx^2|} \\ }$$