backprop

Let's denote $α = (x,θ)$ where $x$ is the input part and $θ$ are trainable parameters and make a model $$ f(α) = f(x;θ) := f_n(...f_2(f_1(x,θ_0),θ_1)..., θ_{n-1})$$ By naming the last layer $f_n:=loss$, the goal becomes to minimize $f(x;θ)$ for $θ$ $$ θ = \argmin_{Θ} f(x;Θ)$$ If $f$ is smooth enough, reasonably convex, gradient descent like algorithm could be a reasonable approach $$ \al{ θ^{(k+1)}_m = θ^{(k)}_m - ε \blue{\frac{d}{dθ_m} f(x;θ^{(k)})} && ∀m}$$

Instead of targetting specifically $θ$, let's write down the gradient for $α$ to keep things lighter. This causes no problem as we would only need to ignore whatever variables we don't want to optimize.

The process could be rewritten as $$\al{ α_1 &= (x,θ_1) && \text{input}\\ \green{α_{L+1}} &\green{= (f_L(α_L),θ_L)} && \text{intermediate} \\ \red{f(x;θ)} &\red{= f(α) = α_{n+1} = (f_n(α_n),∅)} && \text{output} \\ }$$

$$\al{ \blue{\frac{d}{dα_{m}} α_{n+1}} &= J_{α_n}^{f_n}(α_n) ⋅ J_{α_{n-1}}^{f_{n-1}}(α_{n-1}) \cdots J_{α_{m+1}}^{f_{m+1}}(α_{m+1}) ⋅ J_{α_{m}}^{f_{m}}(α_{m}) \\ }$$

One can tell there is a lot of computation recycling to be done here ! $$\al{ \blue{\frac{dα_{n+1}}{dα_{m}}} &= \blue{\frac{dα_{n+1}}{dα_{m+1}}} ⋅ J_{α_{m}}^{f_{m}}(α_{m}) }$$

This recycling is expressed in pytorch as such : class MyFunc(torch.autograd.Function): @staticmethod def forward(ctx, x): ctx.save_for_backward(x) # save input for backward pass return my_function(x) # actual evaluation @staticmethod def backward(ctx, grad_out): if not ctx.needs_input_grad[0]: return None # if grad not required, don't compute x, = ctx.saved_tensors # restore input from context grad_input = torch.functional.vjp(my_function,x,grad_out) # grad_input = grad_out ⋅ J(x) return grad_input # return gradient of the loss

idea

In the case of reservoir computing, $f$ is given and cheap to compute, but we don't have it's derivative.

Thus it is useful to introduce a surrogate model for that, so that one can make backpropagation happen.