Motivation
The availability of GPUs and the raise of neural networks have unfrozen the AI winter for quite a while now.
Most state of the art solution to signal processing, natural language processing, computer graphics, physics simulation, animation, any modelization are neural net based
The principle of a neural network is quite simple. They are inspired by the wirering of neurons in a brain
This computing element can be tuned so that we get a desired global output given a fed global input.
This description is quite loose, and indeed there are many possibilities.
A realistic approach. Let $χ(t)∈ℝ^k$ be the time dependant global input. Let $n∈ℕ$ be the (discrete and finite) number of neurons which are defined with an operator $σ(χ(t),x,p(t),t)$ with $x(t)∈ℝ^n$ the input due to the graph connection, $p(t)∈ℝ^m$ the internal state, $t$ an explicit dependency on time. We can probe any subset of $x$ to get a global output. Finally we define the system evolution with the following differential equation $$\ca{ σ(χ(t),x,p_1(t),t) = x_1(t) \\ σ(χ(t),x,p_2(t),t) = x_2(t) \\ ... \\ σ(χ(t),x,p_n(t),t) = x_n(t) \\ learning(χ(t),x(t),p,t) = 0 \\ }$$ Where $learning$ and $σ$ are operators on some appropriate function space. It may be reasonable to set $learning(χ(t),x(t),p,t)=\dot{p}(t)$ at some point to stop the learning process and keep a somewhat reliably reproducible neural net.
Let's talk about the elephant in the room : this time dependency, even if it looks realistic seems too complicated to compute, and it would require us to reset $x$ to some values to make the neural net behave like a pure function (stateless function, assuming that $p$ is fixed). Moreover one would need to specify "when" to measure the output.
mention the continuous version where there is a "continuous amount" of neurons
A more pragmatic choice would be to introduce time independance, and simplify neurons to $$σ:\{-1,+1\}^n×ℝ^n→\{-1,+1\}$$ $$σ(x,p) = \ca{ +1 & \text{if }∑_ip_ix_i\geq0 \\ -1 & \text{if }∑_ip_ix_i<0 \\ }$$ where $x∈ℕ^n$ is the input, and $p∈ℝ^n$ the neuron's internal parameters. This example was used in Hopfield networks, an old neural network that could restore degraded information.
This is way simpler to compute but by discretizing too much, it becomes difficult to formulate an optimization process.
The most common approach is to design the graph of neuron connections, with neurons defined to be $σ$, a somewhat smooth function, and then optimize neuron's internal parameters using gradient descent on a $loss$ function. $$ \argmin_{p}loss(y,m(x,p))$$ where $m(x,p)$ is the function that represents the output of our model given an input $x$ and internal parameters $p$.
It was observed that a very large variety of graphs, a large variety of $σ$ function work
- more neurons = better performances
- shaping neurons graph to correspond to the input's structure helps (convolutional nets for images/sound because of locality in space/time, notion of topology)
- non-linearity of $σ$ boosts performance (ReLU is cheap and popular, GELU is very effective)
- even noisy random function do work
my criterions for activation function
- repeatable (evaluate twice, gives the same value, cf RReLU is crap)
- complex enough (don't be linear or polynomial, so be messy)
- derivable (fractal example, floor(x) example) : no need, a loose approximation can be better (ReLU+GELU experiment)
- range (bounded ? not bounded ?)
- monotonicity is not required
This brings to the main motivation of this project. Most of computation choices in neural networks are arbitrary. The strategy is quite robust to many changes. What if we replaced the support of computation ? In fact, as long as we can formulate an optimization problem, there is no real need of precise arithmetics emulation on silicon. By exploring the options of using directly physical phenomenon as a support of tunable computation, there is a chance that we can beat CPUs, GPUs, XPUs etc... at a cost of going back to analog systems and unrepeatable computation imprecisions.
Here is a list of why this is interesting :
- potential computation speed gain
- potential energy consumption reduction (no need of a layer of aritmhetic emulation)
- is a completely passive computing device possible ?
Here is a somewhat fantasy idea : a person draws a digit and displays it with a beamer into a piece of glass that looks like a complex lens. On the other side, we get an array of dots, labeled from 0 to 9, and the dot corresponding to the drawn digit gets lid. The computation would be made at the speed of light ! We would have optimized parameters in Maxwell's equations so that the universe solves the problem for us.
However this idea is a bit too whimsical : Huygens superposition principle tells us that we can treat light source separately as there is not light-light interaction. This is a problem. Indeed, as shown previously, a fully linear neural network won't perform very well as in fact, most problems aren't linear. But thankfully, high energy light sources can interact with matter, and thus make non linear interaction. Non-linear optics is a starting point to allow non-linear computation to happen quickly.
Photonic/quantum computing might be somewhat relevant, but in our context we only consider a physical phenomenon as a support of computation. The goal is not to make a sophisticated physical model, but only to exploit naturally occuring computation to our benefit.
https://www.nature.com/articles/s41467-022-31134-5 photonic people may be working on low power NLO
https://www.amazon.com/StarTech-com-Express-Gigabit-Ethernet-Network/dp/B00LPRRJFG
https://books.google.ch/books?hl=en&lr=&id=yfLIg6E69D8C&oi=fnd&pg=PR9&ots=G0NrGSr2lG&sig=L5MRkzudmU9M5t0CC1-CGWBdJ5s&redir_esc=y#v=onepage&q&f=false
a general approach
Let's consider $E≈ℝ^n$ or $ℂ^n$ the set of all possible states,
$P$ the set of all possible internal parameters,
$φ:E×P→E$ a process that transform any given state to another one.
$φ$ is imposed by physics.
Why not $φ:E×P→F$ ? In fact we could work with that, however, having the output space to be the same space as the input allows us to chain $φ$ with itself, and thus make arbitrarily complexfunctions. We only may need to be careful that $φ$ behaves in a nice way : indeed we don't want composition to be trivial otherwise the process would be pointless.
let $φ$ be defined by $(x,θ) = max(θ,x)$, then composition would give
$φ(...φ(φ(φ(x,θ_0),θ_1),θ_2)...,θ_n) = max(θ_0,θ_1,...,θ_n,x) = max(θ_ξ,x)$ where $θ_ξ := max(θ_0,θ_1,...,θ_n)$ So we only need that specific layer $ξ$ and the rest is useless computation.
From this, we split $E$ in two parts : a portion will be used to store internal parameters $p$, and the rest is used for inputs $x$ so that $(x,p)∈E$. Now that we have an optimizable module, we can build many of them, connect them to make a large neural network. If $φ$ is quick to compute, if encoding information in $E$ is quick, then we have a complex setup that is able to compute something non-trivial quickly.
However, being able to evaluate is not enough, we need an algorithm to optimize.
As suggested in
We go a step further by directly modeling the jacobian of $φ$ as this is the actual reason we need the rough model for.
demo with and without jacobian
physical implementation
The idea of using an arbitrary process as a source of computation isn't new. In fact, first computers were analog
The first example we consider is a high energy short pulse that is shaped, and thrown into a non-linear medium from which we get an output pulse to be measured. Let's detail a bit how this works.
Here is the setup:
1. a device generates a light pulse
2. the pulse goes through a pulse shaper
3. the shaped pulse goes through a bit of non-linear medium such as a crystal, a fiber or a waveguide
4. the output is measured
Here is the setup of the shaper:
1. the pulse is diffracted (spectrally spread) with a grating
2. a lens or properly shaped mirror makes the beam parallel
3. the spread pulse goes through a grid of LCD like device with a different value of medium indice $n$ (that is tunable)
4. a lens refocuses the beam into a grating
5. the grating "cancels out" what the first grating did to get a thin beam again.
In this case $E$ is the spectrum of our light pulse and the parameters we set into the pulse shaper. $φ$ is the the operation that takes a certain pulse with a certain spectrum, and returns a new spectrum.
This devices has the following issues :
- it's not very energy efficient
- even if execution time is very short, we need time between pulses, and time to encode information in the shaper.
- it's unclear if we can chain the device with another copy of itself, with no measurement
- measurement will destroy the signal
- we can't measure everything
Composition is an issue.
1. quickly, the pulse doesn't react with the environment anymore
2. we can't measure everything, meaning we can only partially know what element of $E$ we are getting from $φ$.
This means that many different $x∈E$ are indistinguishible to us, and running them through $φ$ might become inpredictible.