During last summer, when I took the machine learning MOOC by AndrewNg on coursera, I had some questions unanswered, and I wanted to visualize a neural net in action. This programm is my attempt to be a bit more knowledgable about it.
The neural net is simply trying to recognize an hand-written digit it is fed and will train on a batch of 5000 images. The images are taken from the classic MNIST database, sized and grayscaled to 40 x 40 for consistency and efficiency. To accomplish this, the neural net is performing a gradient descent to converge, and uses a sigmoid fonction as activation for the neuron.
From a random distribution to start, to converge toward a local minimum.
Different local minimuns visualized after 10000 iterations. We can clearly see that they form vastly different shapes.
Was the choice of 25 hidden layers pertinent ?
To actualy compare it I chose to start with a unique hidden layer, and record it's accuracy over time and did the same up to 10 layers, then to 55 by increment of 5.
What I first assumed was that going over 8 layer didn't improve much our algorithm. Moreover it could prevent an overfit with a high number of iteration. And as first thought, taking more complex algorithm, is detrimental to reach a high enough accuracy.
To look at this effect, I searched for the moment where the accuracy increase begins to stabilize, which I estimated at about 70 iterations, and it is strangely consistent throughout all the models. So I tracked the accuracy level of each neural nets after X amounts of iterations, before that breaking point, wich gave me these results :
A counter intuitive discovery was a better accuracy was reached more quickly with more hidden layer up to a certain point, and that the 25 hidden layers seems in fact the best choice, as it performs the best in low iterations number, and transition well when highly iterated.