Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement adaptive learning rate #30

Closed
kloudkl opened this issue Jan 13, 2014 · 1 comment
Closed

Implement adaptive learning rate #30

kloudkl opened this issue Jan 13, 2014 · 1 comment

Comments

@kloudkl
Copy link
Contributor

kloudkl commented Jan 13, 2014

Commit Yangqing/caffe@4c2c197 says “regarding https://plus.google.com/113952791760990667476/posts/Q5fwQY6JeEq - stopped the adagrad attempt”.

The post and the comments talked about three interacting factors adaptive learning rates, momentum, and synchronicity that greatly impact the stability of the learning process. The discussions did not came to the conclusion that all the adaptive learning rate scheduling schemes are harmful. It is still worthwhile to consider implementing an improved variant of AdaGrad for three reasons.

First, AdaGrad's creator made a new progress. In response to Daniel Povey's concerning about the convergence of AdaGrad, Fernando Pereira mentioned that John Duchi proposed a variant to support asynchronous updates in the comments. The work was published in NIPS 2013[1]. Although the reviewer 6 doubted the method had limited applicability, the author responded that "in additional experiments with image and speech data" they saw "similar benefits to those reported in the paper".

Second, AdaGrad's criticizer improved it. Andrew Senior who are from Google said that AdaGrad performed worse than synchronous and asynchronous SGD in some recent speech experiments. While two[2][3] of his four papers in ICASSP 2013 provided supportive evidence of AdaGrad's performance relative to SGD, he did find the limitation of it and demonstrated that AdaDec which "decouples long-term learning-rate scheduling from per-parameter learning rate variation" achieved better frame accuracies[4].

Third, experiments using both adaptive learning rate and momentum shown stable convergence. The post's author Daniel Povey who found momentum instable finally make it work by limiting parameter updates per minibatch. This could probably be implemented by "clipped gradient" and "sparser gradients via sparse output regularization and rectified outputs" which are two of the ideas that enable effectively training of recurrent neural network[5]. Therefore, adaptive learning rate does not necessarily interfere with momentum and lead to divergent training.

There are already multiple AdaGrad variants with different convergence guarantees. The first step to resolve this issue is to choose one of them that is most suitable for stable synchronous and asynchronous training on image datasets.

[1] John C. Duchi, Michael I. Jordan, and Brendan McMahan. Estimation, Optimization, and Parallelism when Data is Sparse. Neural Information Processing Systems (NIPS 2013).
[2] M.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, G.E. Hinton. On Rectified Linear Units For Speech Processing. 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (2013).
[3] Georg Heigold, Vincent Vanhoucke, Andrew Senior, Patrick Nguyen, Marc'aurelio Ranzato, Matthieu Devin, Jeff Dean. Multilingual acoustic models using distributed deep neural networks. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013).
[4] Andrew Senior, Georg Heigold, Marc'aurelio Ranzato, Ke Yang. An Empirical study of learning rates in deep neural networks for speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013).
[5] Yoshua Bengio, Nicolas Boulanger-Lewandowski, Razvan Pascanu. Advances in Optimizing Recurrent Networks. 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (2013).

@kloudkl
Copy link
Contributor Author

kloudkl commented Jul 21, 2014

The original AdaGrad algorithm[6] has been implemented in #741 by @qipeng.

[6] Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, John Duchi, Elad Hazan, and Yoram Singer. Journal of Machine Learning Research (JMLR 2011).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant