Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod: adjust base LR used by schedulers to scale with the number of workers #2626

Merged
merged 6 commits into from
Jul 23, 2020

Conversation

tgaddair
Copy link
Contributor

In #2574 it was observed that the learning rate used to initialize learning rate schedulers is in conflict with the way we scale the learning rate with the number of Horovod workers. Because the learning rate schedulers are initialized before the scaling, the scaling would be overridden. This PR fixes this so that optimizers and LR schedulers scale up correctly.

Follow-ups to consider would be:

  • Make scaling work for other backends like DDP.
  • Add an option to configure the LR scaling in the Trainer.

@mergify mergify bot requested a review from a team July 16, 2020 22:08
@codecov
Copy link

codecov bot commented Jul 16, 2020

Codecov Report

Merging #2626 into master will increase coverage by 0%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #2626    +/-   ##
=======================================
  Coverage      91%     92%            
=======================================
  Files          72      74     +2     
  Lines        6131    6321   +190     
=======================================
+ Hits         5599    5791   +192     
+ Misses        532     530     -2     

@Borda Borda added the feature Is an improvement or enhancement label Jul 16, 2020
@mergify
Copy link
Contributor

mergify bot commented Jul 21, 2020

This pull request is now in conflict... :(

@tgaddair
Copy link
Contributor Author

@williamFalcon @Borda failing tests seem unrelated. Could you take a look at the PR and see if everything makes sense?

As a follow-up, I want to add a param to Trainer so we can override the LR scale when using Horovod/DDP.

@Borda
Copy link
Member

Borda commented Jul 22, 2020

@williamFalcon @Borda failing tests seem unrelated. Could you take a look at the PR and see if everything makes sense?

yes, no connection to this PR

tests/models/test_horovod.py Outdated Show resolved Hide resolved
tests/models/test_horovod.py Outdated Show resolved Hide resolved
@Borda Borda added the ready PRs ready to be merged label Jul 22, 2020
@mergify mergify bot requested a review from a team July 22, 2020 17:41
@williamFalcon williamFalcon merged commit 1369012 into Lightning-AI:master Jul 23, 2020
@amorehead
Copy link
Contributor

amorehead commented Sep 16, 2022

Hello. I was curious if there have been any other PRs that have added the ability for users to specify which learning rate scaling strategy should be used with e.g., DDP/Horovod. As a frequent user of DDP with Lightning, I would love to see the ability to have my optimizer's learning rate automatically scaled according to the effective batch size across nodes and their GPUs (i.e., the total world size).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants