Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise errors on all ranks for checkpoint download failures #3345

Merged
merged 11 commits into from
May 31, 2024
Merged

Conversation

irenedea
Copy link
Contributor

@irenedea irenedea commented May 30, 2024

What does this PR do?

Raises errors on all ranks if one of the ranks experiences a download failures in order to address a known issue where checkpoint errors are not raised correctly if only not all ranks fail: pytorch/pytorch#122529

Manual Test Runs

Loading valid checkpoint still works: valid-checkpoint-sUUyo9
Loading corrupt checkpoint errors: corrupt-checkpoint-5RYyIa

What issue(s) does this change relate to?

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@irenedea irenedea marked this pull request as ready for review May 30, 2024 23:27
@irenedea irenedea requested a review from mvpatel2000 May 30, 2024 23:28
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo error message -- can we be more descriptive on why we need to kill processes and what action user should take (view logs on failed ranks)

composer/utils/checkpoint.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bigning bigning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the nice fix!

composer/utils/checkpoint.py Outdated Show resolved Hide resolved
composer/utils/checkpoint.py Outdated Show resolved Hide resolved
composer/utils/checkpoint.py Outdated Show resolved Hide resolved
@irenedea irenedea changed the title Terminate all processes for errors in checkpoint download Raise errors on all ranks for checkpoint download failures May 31, 2024
@irenedea irenedea enabled auto-merge (squash) May 31, 2024 04:23
@irenedea irenedea merged commit 3c0a817 into dev May 31, 2024
15 checks passed
@irenedea irenedea deleted the kill-all branch May 31, 2024 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants