Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix incorrect learning rate warm-up after restarting from ckpt #182

Merged
merged 3 commits into from
Aug 2, 2022

Conversation

bozhang-hpc
Copy link
Contributor

A simple fix to #181

@gahdritz
Copy link
Collaborator

Could you explain how this works? Where is resume_last_lr_step defined? When is self.last_lr_step ever assigned to anything except -1?

@bozhang-hpc
Copy link
Contributor Author

Oh, I forgot the definition of resume_last_lr_step(). It's added in the 69a8a28 commit.

@gahdritz
Copy link
Collaborator

gahdritz commented Aug 2, 2022

One more thing: this assumes that the resume_from_ckpt is a ZERO checkpoint. You should add a quick check that the resume_from_ckpt argument is a directory and not a file.

@bozhang-hpc
Copy link
Contributor Author

The non deepspeed ckpt file is supported now.

@gahdritz gahdritz merged commit 87f3cd4 into aqlaboratory:main Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants