I was testing different activation function to figure what features let the activation perform well. I knew that we should
- prevent vanishing gradient (the derivative should not go down to $0$ too easily which will prevent the learning)
- prevent exploding gradient (the derivative should not grow too crazily)
- expand the accessible function space (to leave linear function space)
But there are many more things that could be considered.
What about periodic function ? with varying oscillation frequency ?
hard to predict function (such as cellular automaton result, or chaotic physics experiment) ?
purposely falsified gradient ?
horror function (cantor function and other fractals) ?
smooth noise function (perlin, use a photo or a recording as a function) ?
noised function (activation is a random variable) ?
fourier domain, taylor expansion terms, or other well known transform ?
parametric analytic functions (such as $∑_{i,j}x_i^{x_j}$ and variant that "selects some terms etc...") ?
My intuition was that the function should be "hard to predict" so that it contains a lot of information to be used as a support for compression, but also not too complex (like pure noise) as I feel that we should be able to somewhat parametrize in a nice way, at least for the gradient to be computable.
Here are some ideas I should explore.
ℂ-valued neural network
I empirically found that using $\sin$ as an activation function, I often get pretty good results. And when we think about it, it looks a bit like "working in fourier space" in a not very clear manner...
A very natural way to represent oscillations is complex numbers with complex exponential, plus a lot of physics is made simpler using complex numbers (when they are not simply required like in quantum stuff) mainly because it contains both the oscillation behavior and the growth behavior.
$n$ dimentional activation
It is quite obvious that the activation function doesn't need to be $ℝ→ℝ$. Instead of picking any n-dimentional activation (which may still be interesting), we could consider that given a set of famous activation function that we don't know how to pick, we make a new activation that takes the usual input parameter but also some other input(s) to choose between activations
activation function optimization
We could make a neural net on top of a neural net that tries to find the best set of activation functions.
The idea is simple, yet horrible in term of computation. We make a neural net with parameters as usual, plus parameters in the activation function. The process of "initialize and train the neural net and compute the final loss" is the process to optimize for the activation function (the second neural net on top of it that in turns optimize the activation parameters)
Fourier
An interesting aspect of Fourier transform is that it can be computed at the speed of light with optics
I should develop why this holds and to which limits
"fourier"
This might be a broader question : when we look at a signal and it's fourier transform, it is quite easy to spot which is which. The case that would make this unclear would be a gaussian function.
Sometimes it is easier to process a signal in fourier space, and sometime it is easier in the original space. Conceptually, if we were to write a process that would optimize a neural net to work in fourier space, and then in original space multiple time, we would want to have an even number of fourier transforms. But if we have a deep enough model, does the parity really affects a lot the processing ? Will it keep a strong separation between original space and fourier space ?
blur the gradient
The gradient is a very localized variation indicator. When we do the descent, sometimes it's better to know "where in general we shall go" to avoid local minima. Optimizers with impulsion help with that, maybe we could do something similar by purposely lying on the value of the gradient. An example could be that we evaluate ReLU but compute the derivative of the convlution of ReLU with a gaussian. Though it is not clear if that helps on the final loss "smoothness"
fake the gradient
What happens if we use sigmoid, but add 1 to the derivative ? Half of the time it would help correct back to $0$, but it would also push some parameters to grow for no actual gain. Maybe combined with some regularization it could be useful...
Lens
Roger Tootell did an experiment where he trained a monkey to watch a blinking target while keeping its head at a constant position.
When the monkey was good enough at this task, he made it do that thing again while injecting some radioactive components that would stick to active neurons (something that the active neurons would try to "eat") and then... kill the monkey. Then he sliced it's brain and found the exact image printed in that radioactive component.
A very interesting observation we can make is how the image is deforme : the center part that is quite small is blown up to be the same size as the surrounding part of the image. It makes a lot of sens that where our attention is, we use more neurons.
That gives the obvious idea : we could make a parametric layer that does that "zoom in" (or rather shrink crap) which could help with 2 things
1. drop ressources in a smarter way
2. make it possible to understand where the neural net puts its focus on
The implementation of such thing would be quite straightforward using inverse rendering techniques such as these implementations
architecture iteration
Iteratively refine a model by adding new operators behaving like identity, and progressively amplify non-linearity.
The neural net is seen as a computing graph that grows, following compatibility rules. Each elements of the graph can be of the list
- operator $ℝ^m → ℝ^n$
like linear, convolution, fourier, maxpool, activations ...
or regulizers such as layernorm, batchnorm, dropout, noise generator...
- internal state storage (RNN)
darwinian architecture iteration
Add natural selection to the previous idea where each generation will choose random additions and make random mixtures privileging best performing models.
Continuous neural network
I once found a quite cool paper that suggested to look at neural networks as a discretization of a continous phenomenon. I followed a numerical analysis course where the main idea was iterative refining, so maybe that could allow to start with a coarse neural net, and refine it as it performs better. This is yet very blury and sci-fi but maybe there is a way to make the refinment process choose a good architecture instead of building something by hands.