Implement adaptive learning rate #30

kloudkl · 2014-01-13T15:45:00Z

Commit Yangqing/caffe@4c2c197 says “regarding https://plus.google.com/113952791760990667476/posts/Q5fwQY6JeEq - stopped the adagrad attempt”.

The post and the comments talked about three interacting factors adaptive learning rates, momentum, and synchronicity that greatly impact the stability of the learning process. The discussions did not came to the conclusion that all the adaptive learning rate scheduling schemes are harmful. It is still worthwhile to consider implementing an improved variant of AdaGrad for three reasons.

First, AdaGrad's creator made a new progress. In response to Daniel Povey's concerning about the convergence of AdaGrad, Fernando Pereira mentioned that John Duchi proposed a variant to support asynchronous updates in the comments. The work was published in NIPS 2013[1]. Although the reviewer 6 doubted the method had limited applicability, the author responded that "in additional experiments with image and speech data" they saw "similar benefits to those reported in the paper".

Second, AdaGrad's criticizer improved it. Andrew Senior who are from Google said that AdaGrad performed worse than synchronous and asynchronous SGD in some recent speech experiments. While two[2][3] of his four papers in ICASSP 2013 provided supportive evidence of AdaGrad's performance relative to SGD, he did find the limitation of it and demonstrated that AdaDec which "decouples long-term learning-rate scheduling from per-parameter learning rate variation" achieved better frame accuracies[4].

Third, experiments using both adaptive learning rate and momentum shown stable convergence. The post's author Daniel Povey who found momentum instable finally make it work by limiting parameter updates per minibatch. This could probably be implemented by "clipped gradient" and "sparser gradients via sparse output regularization and rectified outputs" which are two of the ideas that enable effectively training of recurrent neural network[5]. Therefore, adaptive learning rate does not necessarily interfere with momentum and lead to divergent training.

There are already multiple AdaGrad variants with different convergence guarantees. The first step to resolve this issue is to choose one of them that is most suitable for stable synchronous and asynchronous training on image datasets.

[1] John C. Duchi, Michael I. Jordan, and Brendan McMahan. Estimation, Optimization, and Parallelism when Data is Sparse. Neural Information Processing Systems (NIPS 2013).
[2] M.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, G.E. Hinton. On Rectified Linear Units For Speech Processing. 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (2013).
[3] Georg Heigold, Vincent Vanhoucke, Andrew Senior, Patrick Nguyen, Marc'aurelio Ranzato, Matthieu Devin, Jeff Dean. Multilingual acoustic models using distributed deep neural networks. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013).
[4] Andrew Senior, Georg Heigold, Marc'aurelio Ranzato, Ke Yang. An Empirical study of learning rates in deep neural networks for speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013).
[5] Yoshua Bengio, Nicolas Boulanger-Lewandowski, Razvan Pascanu. Advances in Optimizing Recurrent Networks. 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (2013).

The text was updated successfully, but these errors were encountered:

kloudkl · 2014-07-21T09:31:32Z

The original AdaGrad algorithm[6] has been implemented in #741 by @qipeng.

[6] Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, John Duchi, Elad Hazan, and Yoram Singer. Journal of Machine Learning Research (JMLR 2011).

kloudkl mentioned this issue Feb 18, 2014

Consolidate train_net and test_net in memory #119

Closed

kloudkl mentioned this issue Mar 11, 2014

New lr policies, MultiStep and StepEarly #190

Merged

kloudkl mentioned this issue Jul 21, 2014

Solver switching support & implementation of Nesterov's accelerated grad... #741

Closed

kloudkl closed this as completed Jul 21, 2014

chensiqin mentioned this issue Nov 28, 2015

Output accuracies per class. #2935

Merged

shuguang101 mentioned this issue Jan 20, 2018

Segmentation Fault: 11 - OSX high sierra - please Help #6019

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement adaptive learning rate #30

Implement adaptive learning rate #30

kloudkl commented Jan 13, 2014

kloudkl commented Jul 21, 2014

Implement adaptive learning rate #30

Implement adaptive learning rate #30

Comments

kloudkl commented Jan 13, 2014

kloudkl commented Jul 21, 2014