JACOBIATOR

restoreJ on $f(x)=Ax$ : deep fully connected net VS Log-Lin-Exp

First of all, here is a visualization of what this is all about : a way to interpret marchine learning is to "sculpt" a manifold parametrized by inputs to approximate a pointcloud (given by a dataset). A function with high amount of parameters will allow to sculpt a certain amount of surfaces, and the goal is to find what parameters will fit our data best. It is also needed to design that function such that there are many different surfaces to be representable, and also to design it such that it is trainable in an efficient manner.

When designing a neural net, it's not easy to tell what activation to use, as we don't know yet how the training process is going to make use of each part of the neural net in advance. Empirical observation tend to show that some activations are generally better than others.

Here are some properties that are empirically tested, with some mathematical intuition, but no proofs or guarantees.

In this section we will consider $σ:ℝ→ℝ$ (or maybe $σ:ℂ→ℂ$) a $1d$ activation function that is vectorized as such $$ σ(x)_i = σ(x_i) $$ with $x∈ℝ^N$ (or $x∈ℂ^N$)

We will consider that $x,y∈ℝ^{N×n}$ which means they are a list of $N$ vectors of dimension $n$ so the scalar product will apply in a vectorized way $N$ times.

For all losses we wish to minimize down to $0$ (and that can't output negative values), we could minimize $log(x)$ instead.

Is this a good idea ? Contrary to my initial intuition, I think not as if we interpret the learning process as solving the ODE defined by the Loss gradient field, then we would overshoot our particles through the optimal region instead of "calmly" converging toward it.

This is shown in the following illustrations.

show how that changes the learning process

In the following illustration we render the gradient of some $loss(x,y)$ with respect to $x$ where
⋅ $x∈ℝ²$ is the value to be optimized,
⋅ $y∈ℝ²$ is the target value (the target is controller by the mouse.

grid size

amplitude

noise amplitude

noise frequence space

noise frequence time

display mode

motion length

motion speed

color (phase=|loss|)

color speed

$loss(x,y) = -∑_d \max(0,1-sgn(y_d⋅x_d))$

$loss(x,y) = ∑_d RELU(-y_d⋅x_d)$

Mean Squared Error Loss is defined as $MSE(x,y) = mean( ⟨x-y|x-y⟩ )$. If we lock $y$ so that it's only a function of $x$, $MSE$ can be interpreted as a spherically symmetric shaped loss around the target $y$. Whatever the direction, if $x$ gets farther away from $y$, following the gradient will direct $x$ to go straight toward $y$

mouse controls $y$, gradient (wrt $x$) for all $x$

same but this is $log(MSE(x,y))$

$CosSim(x,y) = \frac{⟨x|y⟩}{|x||y|}$ can be interpreted as "disregard vector lengths and compute the angle between them". This might be useful to aligh $x$ and $y$ by maximizing $CosSim(x,y)$ that will never go further than $1$.

We could minimize $1-CosSim(x,y)$.

mouse controls $y$, gradient (wrt $x$) for all $x$

same but this is $log(1-CosSim(x,y))$

The gradient field is independent from $|y|$.

The gradient explodes when $x$ gets close to $0$.

The trajectory we get by following the gradient is of a circular form : depending on the learning algorithm, we could imagine this keeps the norm of $x$ intact.

Also when $x$ is perfectly alined with $y$ in the wrong direction, we get a negligible gradient. However considering a high dimension case, this shouldn't be something to worry about as the scenario will become very unlikely (this case covers only a 1 dimentional subspace).

$DotSim(x,y) = \frac{⟨x|y⟩}{|y||y|} = \frac{⟨x|y⟩}{⟨y|y⟩}$ can be interpreted as cosine similarity with length. We can see this as $x$ mismatch in the referential of $y$. We would like this to become $1$.

We could minimize $1-DotSim(x,y)$ but increasing the length mismatch would improve the loss : this is a problem. Minimizing $MSE(1,DotSim(x,y))$ instead could prevent this issue.

mouse controls $y$, gradient (wrt $x$) for all $x$

mouse controls $x$, gradient (wrt $y$) for all $y$

I did not anticipate that the orthogonal part of $x$ with respect to $y$ has no contribution in the loss.

Then the way the gradient explodes when $|y|→0$, and reduces to $0$ when $|y|→∞$ isn't satisfying
TBD : what about $DotSim(x,y)=\frac{⟨x|y⟩}{|y|}$ and $MSE(|y|,DotSim(x,y))$

$GenSim(x,y) = \frac{⟨x|y⟩}{|x|^α|y^{2-α}|}$ with $α∈[0,2]$ generalizes $DotSim$ and $CosSim$

Let's look at the gradient of $MSE(GenSim(x,y)_α,GenSim(y,y)_α)$

mouse controls $y$, gradient (wrt $x$) for all $x$

mouse controls $x$, gradient (wrt $y$) for all $y$

Let's mix both previous losses : $SumSim = 1 - CosSim(x,y) + MSE(1,DotSim(x,y))$

mouse controls $y$, gradient (wrt $x$) for all $x$

mouse controls $x$, gradient (wrt $y$) for all $y$

$CosSim$ had as minimas all vectors that are aligned with $y$.
$DotSim$ had as minimas the correct solution + the whole orthogonal space to $y$.

By summing both losses, we get a loss with a gradient that falls into a single point.

$1 - CosSim(x,y) + MSE(x,y)$

$1 - CosSim(x,y) + MSE(1.,DotSim(x,y))$

$CosSim$ had as minimas all vectors that are aligned with $y$.
$DotSim$ had as minimas the correct solution + the whole orthogonal space to $y$.

By summing both losses, we get a loss with a gradient that falls into a single point.

Here is some feature that makes an activation better or worse :
- Jacobian does not get too large (exploding gradient) (what about clamping ?)
- Jacobian is not 0 too often
- "bluring" the Jacobiancan can help (derivative computed from behaviour in a neighborhood)
- oscillating activations are good
- too complex activation function can reduce performance

Let's name all intermediate values in the computation of $f$ $$\al{ α_1 &= (x,θ_1) && \text{input}\\ %\green{α_{L+1}} &\green{= (f_L(α_L),θ_L)} && \text{intermediate} \\ \blue{α_{m+1}} &\blue{= (f_m(α_m),θ_m)} && \text{intermediate} \\ \red{f(x,θ) = f(α) = α_{n+1} } &\red{= (f_n(α_n),∅)} && \text{output (loss) ∈ ℝ} \\ }$$ Our goal is to compute the derivative of the final output $\red{α_{n+1}}$ relative to $\blue{α_{m},∀m}$ $$\al{ \blue{\frac{d}{dα_{m}}}\red{α_{n+1}} &= J_{α_n}^{f_n}(α_n) ⋅ J_{α_{n-1}}^{f_{n-1}}(α_{n-1}) \cdots J_{α_{m+1}}^{f_{m+1}}(α_{m+1}) ⋅ J_{α_{m}}^{f_{m}}(α_{m}) \\ }$$ One can tell there is a lot of computation recycling to be done here ! $$\al{ \blue{\frac{d}{dα_{m}}}\red{α_{n+1}} &= \blue{\frac{d}{dα_{m+1}}}\red{α_{n+1}} ⋅ J_{α_{m}}^{f_{m}}(α_{m}) }$$

class MyFunc(torch.autograd.Function): @staticmethod def forward(ctx, x, θ): ctx.save_for_backward(x,θ) # save input for backward pass return my_function(x,θ) # actual evaluation @staticmethod def backward(ctx, grad_out): if not ctx.needs_input_grad[0]: return None # if grad not required, don't compute x,θ = ctx.saved_tensors # restore input from context grad_inputs = torch.functional.vjp(my_function,x,grad_out) # grad_input = grad_out ⋅ J(x) return grad_inputs # return gradient of the loss

In the application goal, we expect the backpropagation to not properly model the backpropagation

To look how an imperfect Jacobian affects the learning process, here is the destroy the Jacobian experiment. In this experiment, the xor classification problem is implemented on a fully connected neural net made of linear models followed by GELU. The Linear model is modified such that we can distord gradiant computation during backpropagation.

The dataset is a gaussian distributed point cloud with some random misclassification. Note that as the model gets deeper, even without perturbation, we get instability.

$∇_x$, $∇_θ$ are obtained from the Jacobian combined with $x$ and $θ$. This experiment shows how the learning process is affected when
- one randomly flips the sign of $∇_x$, $∇_θ$ and/or $θ$.
- only keep the sign of $∇_x$, $∇_θ$ and/or $θ$.

Keeping only the sign for $∇_θ$ seem to work fine, however, this is not true for $∇_x$. This is probably because $∇_θ$ is not backpropagated and acts only "locally" in the pipeline.

try intermediate values for sign keeping

The learning process looks quite robust to random sign flipping of random components of $∇_x$ and $∇_θ$. Even at $p=0.45$ flipping probability ($p=.5$ being complete random and $p=1$ constant flip) we are able to learn.

Given a function $f(x,θ)$ we train a model $g$ such that $g(∇_{f(x,θ)},x,θ)=(∇_x,∇_θ)$. For simplicity let's call $(x,θ)=α$

Then this becomes a function $f(α)$ with a model $g(∇_{f(α)},α)=∇_α$.

We can combine the chain rule $$ ∇_α = ∇_{f(α)} ⋅ J_f(α) $$ with the definition of $J$ $$ J⋅dα = df $$ to obtain the following scalar relationship $$ \comment{∇_{f(α)}⋅J}{∇_α}⋅dα = ∇_{f(α)}⋅\comment{J⋅dα}{df} $$ By sampling multiple $dα$, $α$ and $∇_{f(α)}$ we optimize our model $g(∇_{f(α)},α)≈∇_α$ $$g(∇_{f(α)},α)⋅dα ≈ ∇_{f(α)}⋅df$$ We could use MSE Loss, but also cosine similarity looks to be a good candidate though it should be unable to get the proper gradient scale. A mix of both could be considered as well, which is discussed later.

Restore the Jacobian experiment shows how the following algorithm is able to restore the jacobian of the simple function $f(x,θ)=θ⋅x$ where $θ$ is a matrix (of shape (dim_i,dim_o)). Instead of making $g$, we "help" the algorithm a bit by providing the structure of the expected Jacobian, which is also useful to visualize how well the algorithm is doing (in term of representing the Jacobian, which in a way is not the real objective of this).

All runs have been normalized such that
- dim_sample determines the amount of dimension to be sampled during a mini-batch.
- the epoch count is computed in proportion of dim_i*dim_o and dim_sample

Given that cos similarity doesn't understand scale, a "initial_messup" parameter has been introduced so that the random paramaeters of $J$ could be scaled at first.

The behavior at lower or higher dimension look a bit different. It looks like for the same dim_sample (in proportion) the random subspace sampling improvement works better on higher dimension, which is good news for how we want to use this.

It is also worth mentionning some strange situation where the loss is going up, while the sign of all components of the Jacobian is improving, meaning there is a chance the estimated gradient could at least point to the right direction.

In the application goal, we expect the backpropagation to not properly model the backpropagation