-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise errors on all ranks for checkpoint download failures #3345
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo error message -- can we be more descriptive on why we need to kill processes and what action user should take (view logs on failed ranks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the nice fix!
Co-authored-by: bigning <ning.wang@databricks.com>
What does this PR do?
Raises errors on all ranks if one of the ranks experiences a download failures in order to address a known issue where checkpoint errors are not raised correctly if only not all ranks fail: pytorch/pytorch#122529
Manual Test Runs
Loading valid checkpoint still works:
valid-checkpoint-sUUyo9
Loading corrupt checkpoint errors:
corrupt-checkpoint-5RYyIa
What issue(s) does this change relate to?
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)