Here is a quick analysis on how the depth of the auto-encoder influences its performance.
Meaning none of the adversarial, bijectivity and smoothness loss are applied.
Because the smaller the minibatch, the more gradient descent steps there is per epoch,
the training has been "renormalized" such that the total gradient is proportional to the size of minibatches
(I could also have increase/decrease max-epochs to make a corresponding iteration count).
These image are only auto-encoded images from the training set. Nothing is generated here.
A counter intuitive observation is that with deeper model and smaller batch size, we get mode collapse on the auto-encoder.
This is probably reaching a local minimum where fitting perfectly one sample gives a low MSE, and close images are close enough
that fitting them better doesn't make their MSE much better, but reduces the overfitted sample too much.
Comparing deeper and shallowed fully connected network, we see that in the deeper model, in addition to mode
collapse, the images are blurier.
Because of these observations, I would keep the fully connected part shallower.
On the convlutionnal part, it seems like the deeper model gives better results.
However using the full dataset as a minibatch seems to slow down the convergence,
and having it too small also slows it down.
My intuition is that on too few samples, we do some correction overfitting a small population which are canceled
out on the next minibatch ; on too many samples, I don't have a good intuition.
I should investigate why this happens.