restoreJ on $f(x)=Ax$ : deep fully connected net VS Log-Lin-Exp
We can argue that the Jacobian is only a good heuristic for learning, but not necessary to be modeled in detail to allow learning.
This page proposes an approach to allow backprop on a given $f$ of unknown Jacobian, for cheaper than a full Jacobian modelisation.
$∇_{λ}$ stands for the gradient of the loss with respect to $λ$
I will diverge a little from the usual convention for readability
standard
me
abbreviation
train
learn
L
test
test
T
validation
validation
V
The function $f$ will be refered to as blackbox.
I'd like to call the backprop model for $f$ backbox, but it would become confusing, so let's call it ___________
The motivation for this project is to use a physical process as a source of computation which is known as reservoir computing. There are many approaches used to train physical devices such as
- evolutive selection / annealing : make random variations from the best ones.
- derivable surrogate : make an differentiable approximation so that we can differentiate it instead
The intuition of why this would be something useful to investigate is that machine learning has shown how qualitative computing can raise to very powerful computations. So in contrast to scientific computing, there are many applications that do not require high precision numbers representation, and thus the in-silico emulation of arithmetics could be ditched. This could allow much simpler and fault tolerant computing devices.
If such a device can be made out of computation made by light propagation, one could hope for very low power consumption and high computing speed.
In our everyday life, light propagation has a very nice property : the principle of superposition.
But for a ML application, this propery is terrible : we want to be able to lose the propertyor linearity.
$$ \text{model}(αx+βy) = α⋅\text{model}(x)+β⋅\text{model}(y) $$
However when we consider very high energy light sources, this principle of superposition disappears. Indeed, when the field $E$ is strong, charged particles in the material get displaced, and interact with other particles. This causes non-trivial light-matter interaction, which causes non-linear behavior in the light propagation. This is excellent news for what we wish to make.
crash course of NLO
First of all, here is a visualization of what this is all about : a way to interpret marchine learning is to "sculpt" a manifold parametrized by inputs to approximate a pointcloud (given by a dataset).
A function with high amount of parameters will allow to sculpt a certain amount of surfaces, and the goal is to find what parameters will fit our data best. It is also needed to design that function such that there are many different surfaces to be representable, and also to design it such that it is trainable in an efficient manner.
When designing a neural net, it's not easy to tell what activation to use, as we don't know yet how the training process is going to make use of each part of the neural net in advance. Empirical observation tend to show that some activations are generally better than others.
Here are some properties that are empirically tested, with some mathematical intuition, but no proofs or guarantees.
In this section we will consider $σ:ℝ→ℝ$ (or maybe $σ:ℂ→ℂ$) a $1d$ activation function that is vectorized as such
$$ σ(x)_i = σ(x_i) $$
with $x∈ℝ^N$ (or $x∈ℂ^N$)
This is what composition looks like.
$$\al{
σ(A⋅σ(x))_k
= σ\left( ∑_j A_{ij} ⋅ σ(x_j) ⋅ e_i \right)_k = σ\left( ∑_j A_{kj} ⋅ σ(x_j) \right) \\
}$$
For the intuition, I think it is good to keep in mind how components are mixed before feeding to the next layer.
$$σ(x)=∑_{k=0}^{N} λ_k x^k $$
Polynoms are famous for being bad activation functions.
Here is a list of reasons why that could be :
- composition of polynoms are polynoms : what about non-analytic function ?
Note that the degree increases exponentially with iterations $\deg(σ∘...∘σ) = \deg(σ^{(k)}) = \deg(σ)^k $, (maybe this could cause numerical instability as well.)
- any function can be approximated with a sufficiently large fully connected net, except if the activation function is polynomial (https://doi.org/10.1016/S0893-6080(05)80131-5)
- the derivative is not bounded : the gradient could explose.
Because we use backpropagation, it looks to be necessary to be differentiable. However that is not necessarily the case. Indeed the Jacobian is only a direction indicator : by smoothing the function, we might make the training possible and the activation usable, or even improve learning capability.
An example would be to use a threshold function for the forward pass, but use sigmoid for the backward pass.
Another example is to use candor's function as an activation function : the Jacobian is $0$ almost everywhere, but if one uses identity as a Jacobian replacement, we get better performances than identity activation function (tested in simple tasks such as mnist classification, xor problem).
One argument for very complex activation function is that they take a tremendous amount of computation to resimulate. Which is to say that we could hope to expoit the source of computation for our own objectives. However we should note that there is no guarantee that we can exploit that complexity, nor that the complexity is in itself useful. As an example, we could make a deterministic noise function or an arbitrary gigantic lookup table. So what are the properties we want (or want to avoid) from an activation function ?
The "jeff" activation function visualizes this problem. It looks like having a too complex activation function "constrains" us to a too specific subset of functions.
Empirical experiments using fixed perlin-noise as activation function showed poor performances compared to even ReLU.
Monotonocity do not seem to be required. Sinus looks pretty efficient.
ref sur le truc de stockage d'information
le mega plot avec toutes les fonctions d'activations
We will consider that $x,y∈ℝ^{N×n}$ which means they are a list of $N$ vectors of dimension $n$ so the scalar product will apply in a vectorized way $N$ times.
For all losses we wish to minimize down to $0$ (and that can't output negative values), we could minimize $log(x)$ instead.
Is this a good idea ? Contrary to my initial intuition,
I think not as if we interpret the learning process as solving the ODE defined by the Loss gradient field,
then we would overshoot our particles through the optimal region instead of "calmly" converging toward it.
This is shown in the following illustrations.
show how that changes the learning process
In the following illustration we render the gradient of some $loss(x,y)$ with respect to $x$ where
⋅ $x∈ℝ²$ is the value to be optimized,
⋅ $y∈ℝ²$ is the target value (the target is controller by the mouse.
grid size
amplitude
noise amplitude
noise frequence space
noise frequence time
display mode
motion length
motion speed
color (phase=|loss|)
color speed
$loss(x,y) = -∑_d \max(0,1-sgn(y_d⋅x_d))$
$loss(x,y) = ∑_d RELU(-y_d⋅x_d)$
Mean Squared Error Loss is defined as $MSE(x,y) = mean( ⟨x-y|x-y⟩ )$. If we lock $y$ so that it's only a function of $x$, $MSE$ can be interpreted as a spherically symmetric shaped loss around the target $y$. Whatever the direction, if $x$ gets farther away from $y$, following the gradient will direct $x$ to go straight toward $y$
mouse controls $y$, gradient (wrt $x$) for all $x$
same but this is $log(MSE(x,y))$
$CosSim(x,y) = \frac{⟨x|y⟩}{|x||y|}$ can be interpreted as "disregard vector lengths and compute the angle between them". This might be useful to aligh $x$ and $y$ by maximizing $CosSim(x,y)$ that will never go further than $1$.
We could minimize $1-CosSim(x,y)$.
mouse controls $y$, gradient (wrt $x$) for all $x$
same but this is $log(1-CosSim(x,y))$
The gradient field is independent from $|y|$.
The gradient explodes when $x$ gets close to $0$.
The trajectory we get by following the gradient is of a circular form :
depending on the learning algorithm, we could imagine this keeps the norm of $x$ intact.
Also when $x$ is perfectly alined with $y$ in the wrong direction, we get a negligible gradient.
However considering a high dimension case, this shouldn't be something to worry about as the scenario will become very unlikely
(this case covers only a 1 dimentional subspace).
$DotSim(x,y) = \frac{⟨x|y⟩}{|y||y|} = \frac{⟨x|y⟩}{⟨y|y⟩}$ can be interpreted as cosine similarity with length. We can see this as $x$ mismatch in the referential of $y$. We would like this to become $1$.
We could minimize $1-DotSim(x,y)$ but increasing the length mismatch would improve the loss : this is a problem.
Minimizing $MSE(1,DotSim(x,y))$ instead could prevent this issue.
mouse controls $y$, gradient (wrt $x$) for all $x$
mouse controls $x$, gradient (wrt $y$) for all $y$
I did not anticipate that the orthogonal part of $x$ with respect to $y$ has no contribution in the loss.
Then the way the gradient explodes when $|y|→0$, and reduces to $0$ when $|y|→∞$ isn't satisfying
TBD : what about $DotSim(x,y)=\frac{⟨x|y⟩}{|y|}$ and $MSE(|y|,DotSim(x,y))$
$GenSim(x,y) = \frac{⟨x|y⟩}{|x|^α|y^{2-α}|}$ with $α∈[0,2]$ generalizes $DotSim$ and $CosSim$
Let's look at the gradient of $MSE(GenSim(x,y)_α,GenSim(y,y)_α)$
α
mouse controls $y$, gradient (wrt $x$) for all $x$
mouse controls $x$, gradient (wrt $y$) for all $y$
mouse controls $y$, gradient (wrt $x$) for all $x$
mouse controls $x$, gradient (wrt $y$) for all $y$
$CosSim$ had as minimas all vectors that are aligned with $y$.
$DotSim$ had as minimas the correct solution + the whole orthogonal space to $y$.
By summing both losses, we get a loss with a gradient that falls into a single point.
$1 - CosSim(x,y) + MSE(x,y)$
$1 - CosSim(x,y) + MSE(1.,DotSim(x,y))$
$CosSim$ had as minimas all vectors that are aligned with $y$.
$DotSim$ had as minimas the correct solution + the whole orthogonal space to $y$.
By summing both losses, we get a loss with a gradient that falls into a single point.
Here is some feature that makes an activation better or worse :
- Jacobian does not get too large (exploding gradient) (what about clamping ?)
- Jacobian is not 0 too often
- "bluring" the Jacobiancan can help (derivative computed from behaviour in a neighborhood)
- oscillating activations are good
- too complex activation function can reduce performance
The spirit of the problem is that we have a model made of a composition of many simple functions with lots of optimizable parameters :
$$ \blue{model(x,θ) = f_{n-1}(...f_1(f_0(x,θ_0),θ_1),...,θ_{n-1})}$$
And we wish to optimize $θ$ to minimize a certain
$$loss(\blue{model(dataset\_in,θ)},dataset\_out)$$
To make notation clearer it is better to interpret this as simply minimizing for $α$ this big function
$$\al{
f(α) := f(x,θ) &:= loss(\blue{model(dataset\_in,θ)},dataset\_out) \\
&:= f_n(...f_2(f_1(x,θ_0),θ_1)..., θ_{n-1})
}$$
Pretending $f$ is smooth enough and reasonably convex, gradient descent like algorithm can be used to find $θ$
$$ \al{ θ^{(k+1)}_m = θ^{(k)}_m - ε \blue{\frac{d}{dθ_m} f(x;θ^{(k)})} && ∀m}$$
How do we compute the gradient in an efficient manner ? Backpropagation is a solution to that.
Let $f:ℝ^a → ℝ^b$ be a differentiable function with named argument $f(α)$
The Jacobian of $f$ at $α$ is the linear operator $J_f(α) : ℝ^a → ℝ^b$ defined as
$$ J_f(α) = \mat{\frac{∂f_i}{∂α_j}(α)}_{i,j} $$
It maps input variation to output variation
$$ J_f(α) ⋅ dα ≈ df $$
Let's name all intermediate values in the computation of $f$
$$\al{
α_1 &= (x,θ_1) && \text{input}\\
%\green{α_{L+1}} &\green{= (f_L(α_L),θ_L)} && \text{intermediate} \\
\blue{α_{m+1}} &\blue{= (f_m(α_m),θ_m)} && \text{intermediate} \\
\red{f(x,θ) = f(α) = α_{n+1} } &\red{= (f_n(α_n),∅)} && \text{output (loss) ∈ ℝ} \\
}$$
Our goal is to compute the derivative of the final output $\red{α_{n+1}}$ relative to $\blue{α_{m},∀m}$
$$\al{
\blue{\frac{d}{dα_{m}}}\red{α_{n+1}}
&= J_{α_n}^{f_n}(α_n) ⋅
J_{α_{n-1}}^{f_{n-1}}(α_{n-1}) \cdots
J_{α_{m+1}}^{f_{m+1}}(α_{m+1}) ⋅
J_{α_{m}}^{f_{m}}(α_{m}) \\
}$$
One can tell there is a lot of computation recycling to be done here !
$$\al{
\blue{\frac{d}{dα_{m}}}\red{α_{n+1}}
&= \blue{\frac{d}{dα_{m+1}}}\red{α_{n+1}} ⋅ J_{α_{m}}^{f_{m}}(α_{m})
}$$
In our true usecase, we only need to optimize $θ$ but still need the gradient against $x$ because of backprop.
(note that accuracy of the gradient relative to $x$ is more important than the one relative to $θ$ as shown in the broken Jacobian experiment)
$$\al{
\blue{\frac{d}{dθ_{m}} x_{n+1}}
&=
J_{\orange{x_{n }}}^{f_{n }}(x_{n },θ_{n }) ⋅
J_{\orange{x_{n-1}}}^{f_{n-1}}(x_{n-1},θ_{n-1}) \cdots
J_{\orange{x_{m+1}}}^{f_{m+1}}(x_{m+1},θ_{m+1}) ⋅
J_{\purple{θ_{m }}}^{f_{m }}(x_{m },θ_{m }) \\
}$$
where
$$J_{\orange{x_n}}^{f_n}(x,θ) = \mat{\frac{∂f_{n,i}}{∂\orange{x_{n}}_j}(x,θ)}_{i,j} $$
Indeed, the process could be rewritten as
$$\al{
x_1 &= x && \text{input}\\
\green{x_{l+1}} &\green{= f_l(x_l, θ_l)} && \text{intermediate} \\
\red{f(x;θ)} &\red{= x_{n+1}} && \text{output} \\
}$$
Indeed, one can compute
$$\al{
\blue{\frac{d}{dθ_{m}} x_{n+1}}
&= \frac{d}{dθ_{m}} \green{f_n(x_n;θ_n)} \\
\text{(chain rule)}
&= J_{x_n}^{f_n}(x_n;θ_n) \blue{\frac{d}{dθ_{m}} x_n}
+ J_{θ_n}^{f_n}(x_n;θ_n) \comment{\frac{d}{dθ_{m}} θ_n}{δ_{m,n}}\\
\text{(recurse)}
&= J_{x_n}^{f_n}(x_n;θ_n) ⋅
\blue{\left(J_{x_{n-1}}^{f_{n-1}}(x_{n-1};θ_{n-1}) {\color{cyan}\frac{d}{dθ_{m}} x_{n-1}} + J_{θ_{n-1}}^{f_{n-1}}(x_{n-1};θ_{n-1}) \comment{\frac{d}{dθ_{m}} θ_{n-1}}{δ_{m,{n-1}}} \right)} \\
\text{(recurse)}
&= J_{x_n}^{f_n}(x_n;θ_n) ⋅
J_{x_{n-1}}^{f_{n-1}}(x_{n-1};θ_{n-1}) \cdots
J_{x_{m+1}}^{f_{m+1}}(x_{m+1};θ_{m+1}) ⋅
\left(
J_{x_m}^{f_m}(x_m;θ_m) \comment{\frac{d}{dθ_m} x_m}{0} +
J_{θ_{m}}^{f_{m}}(x_{m};θ_{m})
\right) \\
}$$
class MyFunc(torch.autograd.Function):
@staticmethod
def forward(ctx, x, θ):
ctx.save_for_backward(x,θ) # save input for backward pass
return my_function(x,θ) # actual evaluation
@staticmethod
def backward(ctx, grad_out):
if not ctx.needs_input_grad[0]: return None # if grad not required, don't compute
x,θ = ctx.saved_tensors # restore input from context
grad_inputs = torch.functional.vjp(my_function,x,grad_out) # grad_input = grad_out ⋅ J(x)
return grad_inputs # return gradient of the loss
In the application goal, we expect the backpropagation to not properly model the backpropagation
To look how an imperfect Jacobian affects the learning process, here is the destroy the Jacobian experiment. In this experiment, the xor classification problem is implemented on a fully connected neural net made of linear models followed by GELU. The Linear model is modified such that we can distord gradiant computation during backpropagation.
The dataset is a gaussian distributed point cloud with some random misclassification. Note that as the model gets deeper, even without perturbation, we get instability.
$∇_x$, $∇_θ$ are obtained from the Jacobian combined with $x$ and $θ$. This experiment shows how the learning process is affected when
- one randomly flips the sign of $∇_x$, $∇_θ$ and/or $θ$.
- only keep the sign of $∇_x$, $∇_θ$ and/or $θ$.
Keeping only the sign for $∇_θ$ seem to work fine, however, this is not true for $∇_x$.
This is probably because $∇_θ$ is not backpropagated and acts only "locally" in the pipeline.
try intermediate values for sign keeping
The learning process looks quite robust to random sign flipping of random components of $∇_x$ and $∇_θ$. Even at $p=0.45$ flipping probability ($p=.5$ being complete random and $p=1$ constant flip) we are able to learn.
Given a function $f(x,θ)$ we train a model $g$ such that $g(∇_{f(x,θ)},x,θ)=(∇_x,∇_θ)$.
For simplicity let's call $(x,θ)=α$
Then this becomes a function $f(α)$ with a model $g(∇_{f(α)},α)=∇_α$.
We can combine the chain rule
$$ ∇_α = ∇_{f(α)} ⋅ J_f(α) $$
with the definition of $J$
$$ J⋅dα = df $$
to obtain the following scalar relationship
$$ \comment{∇_{f(α)}⋅J}{∇_α}⋅dα = ∇_{f(α)}⋅\comment{J⋅dα}{df} $$
By sampling multiple $dα$, $α$ and $∇_{f(α)}$ we optimize our model $g(∇_{f(α)},α)≈∇_α$
$$g(∇_{f(α)},α)⋅dα ≈ ∇_{f(α)}⋅df$$
We could use MSE Loss, but also cosine similarity looks to be a good candidate though it should be unable to get the proper gradient scale. A mix of both could be considered as well, which is discussed later.
Restore the Jacobian experiment shows how the following algorithm is able to restore the jacobian of the simple function $f(x,θ)=θ⋅x$ where $θ$ is a matrix (of shape (dim_i,dim_o)). Instead of making $g$, we "help" the algorithm a bit by providing the structure of the expected Jacobian, which is also useful to visualize how well the algorithm is doing (in term of representing the Jacobian, which in a way is not the real objective of this).
All runs have been normalized such that
- dim_sample determines the amount of dimension to be sampled during a mini-batch.
- the epoch count is computed in proportion of dim_i*dim_o and dim_sample
Given that cos similarity doesn't understand scale, a "initial_messup" parameter has been introduced so that the random paramaeters of $J$ could be scaled at first.
The behavior at lower or higher dimension look a bit different.
It looks like for the same dim_sample (in proportion) the random subspace sampling improvement works better on higher dimension, which is good news for how we want to use this.
It is also worth mentionning some strange situation where the loss is going up, while the sign of all components of the Jacobian is improving, meaning there is a chance the estimated gradient could at least point to the right direction.
In the application goal, we expect the backpropagation to not properly model the backpropagation