Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGD Learning Rate 'Burn In' #15

Closed
bobo0810 opened this issue Sep 19, 2018 · 5 comments
Closed

SGD Learning Rate 'Burn In' #15

bobo0810 opened this issue Sep 19, 2018 · 5 comments

Comments

@bobo0810
Copy link

Hi , didn't the learning rate update during the training phase?

@glenn-jocher
Copy link
Member

@bobo0810 yes, I think you are talking about the SGD learning rate 'burn in', which is supposed to be much smaller for the first 1000 batches of training. This was brought up by @xyutao in issue #2.

I'm going to switch the training from Adam to SGD with burn in in a new commit soon.

@glenn-jocher glenn-jocher changed the title learning_rate issue? SGD Learning Rate 'Burn In' Sep 19, 2018
@glenn-jocher
Copy link
Member

@bobo0810 do you have an exact definition of the learning rate over the training? I tried switching to SGD and implementing a burn-in phase but was unsuccessful, the losses diverged before the burn-in completed.

From darknet I think the correct burnin in formula is this, which will slowly ramp up the LR to 1e-3 after 1000 iterations and leave it there:

# SGD burn-in
if (epoch == 0) & (i <= 1000):
    power = ??
    lr = 1e-3 * (i / 1000) ** power
    for g in optimizer.param_groups:
        g['lr'] = lr

I can't find the correct value of power though. I tried with power=2 and training diverged around 200 iterations. Increasing to power=5 training diverges after 400 iterations. power=10 also diverges.

I see that the divergence is in the width and height losses, the other terms appear fine. I think one problem may be that the width and height terms are bound at zero at the bottom, but are unbound at the top, so its possible that the network is predicting impossibly large widths and heights, causing the losses there to diverge. I may need to bound these or redefine the width and height terms and try again. I used a variant of the width and height terms for a different project that had no divergence problems with SGD.

@glenn-jocher
Copy link
Member

@bobo0810 I've switched from Adam to SGD with burn-in (which exponentially ramps up the learning rate from 0 to 0.001 over the first 1000 iterations) in commit a722601.

@bobo0810
Copy link
Author

thank you very much

@glenn-jocher
Copy link
Member

@bobo0810 your welcome, but the change opened up different issues, mainly that the height and width terms diverged during training, so I had to bound these using new height and width calculations. See issue #2 for a full explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants