gradiator
We want to do backpropagation on $f:ℝ^m→ℝ^n$ so we write the following pytorch function
def gradiator(fun,grad_fun):
class GradiatorFunction(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x) # save input for backward pass
return fun(x)
@staticmethod
def backward(ctx, grad_output):
if not ctx.needs_input_grad[0]: return None # if grad not required, don't compute
x, = ctx.saved_tensors # restore input from context
grad_input = grad_fun(x,grad_output)
return grad_input
return GradiatorFunction.apply
Where
How should we train
And now I really want to use Moore-Penrose inverse $$\ca{ dx^{-1}_L := (dx^T⋅dx)^{-1}dx^T &⇐& dx^{-1}_L⋅dx = Id \\ dx^{-1}_R := dx^T(dx⋅dx^T)^{-1} &⇐& dx⋅dx^{-1}_R = Id }$$ But I can't because the one I needs corresponds requires to inverse a matrix of rank $1$...
Give it as many samples as there are dimensions. Then we can do : $$\al{ grad\_fun(x,grad_{out}) ⋅ dx ⋅ dx^{-1}_R &= grad_{out}^T ⋅ df ⋅ dx^{-1}_R \\ ⇒grad\_fun(x,grad_{out}) ⋅ dx ⋅ dx^T(dx⋅dx^T)^{-1} &= grad_{out}^T ⋅ df ⋅ dx^T(dx⋅dx^T)^{-1} \\ ⇒grad\_fun(x,grad_{out}) &= grad_{out}^T ⋅ df ⋅ dx^T(dx⋅dx^T)^{-1} \\ }$$
What about adding slight noise to make the thing invertible ? Is that a bad idea ? Let $ε$ be a noise matrix. $$ dx^{-1}_R := dx^T(dx⋅dx^T + ε)^{-1} $$ Then let's check how badly it could be as an inverse. $$\al{ dx ⋅ dx^{-1}_R &= dx ⋅ dx^T (dx⋅dx^T + ε)^{-1} \\ ⇒ dx ⋅ dx^{-1}_R + ε(dx⋅dx^T + ε)^{-1} &= (dx ⋅ dx^T+ε) (dx⋅dx^T + ε)^{-1} \\ ⇒ dx ⋅ dx^{-1}_R + ε(dx⋅dx^T + ε)^{-1} &= Id \\ ⇒ dx ⋅ dx^{-1}_R &= Id - ε(dx⋅dx^T + ε)^{-1}\\ }$$
Let's try this out
D=10
x = np.random.randn(D) # x
xx= np.outer(x,x) # x⋅x.T
ε = np.random.randn(D,D) * 0.01 # noise
xxε = np.linalg.inv(xx+ε) # (x⋅x.T + ε)⁻¹
xR = np.matmul(x, xxε)
shouldbe1 = np.identity(D) - (ε@xxε)
One could realize this is stupid : indeed, $dx⋅dx^{-1}_R$ is the outer product of $2$ vectors, so there is no way this is going to look like identity.
My intuition says that when we go to higher dimension, we'll keep this kind of problem for the "left over" dimensions... is that true though ? Let's try it numerically first.
D=10
N=D//2
x = np.random.randn(D,N) # x
xx= x@x.T # x⋅x.T
ε = np.random.randn(D,D) * 0.01 # noise
xxε = np.linalg.inv(xx+ε) # (x⋅x.T + ε)⁻¹
xR = x.T @ xxε
shouldbeId = x@xR # does xR work as inverse ?
shouldbeId = np.identity(D) - (ε@xxε) # should be equivalent
errorfromId = ε@xxε # how bad is it ?
The following is showing an image of
Let's optimize $grad\_fun$ to minimize $|grad\_fun ⋅ dx - grad_{out}^T ⋅ df|_2$
We'll assume that we have a few samples of $dx, df, grad_{out}$ at a time.
For simplicity, lets rename everthing :
$A:=grad\_fun$
$B:=dx$
$C:=grad_{out}^T ⋅ df$
Now we want to optimize $A = \argmin_A |A⋅B - C|_2$ $$\al{ |A⋅B - C|_2&:= \sup_{||x||_2=1} ||(A⋅B - C)⋅x||_2 \\ }$$
An interpretation of $grad\_fun ⋅ dx = grad_{out}^T ⋅ df$ is that $grad\_fun$ has a certain amplitude along $dx$ while we don't know anything about other axes. $$\al{ &grad\_fun ⋅ dx &= grad_{out}^T ⋅ df \\ ⇒ &grad\_fun ⋅ \frac{dx}{|dx|} &= \frac{grad_{out}^T ⋅ df}{|dx|} \\ }$$ So $$grad\_fun = \frac{dx}{|dx|} \frac{grad_{out}^T ⋅ df}{|dx|} + λ⋅dx^{\perp}$$ Where by $λ⋅dx^{\perp}$ we mean "some unknown contribution on ortogonal axes".
Now if $dx$ includes multiple samples, how can we improve $grad\_fun$ ? $$\ca{ grad\_fun\_1 = \frac{dx^1}{|dx^1|} \frac{grad_{out}^T ⋅ df^1}{|dx^1|} \\ grad\_fun\_2 = \frac{dx^2}{|dx^2|} \frac{grad_{out}^T ⋅ df^2}{|dx^2|} \\ }$$