Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition fix in checkpoint loading util #3001

Merged
merged 12 commits into from
Feb 13, 2024
Merged

Conversation

jessechancy
Copy link
Contributor

@jessechancy jessechancy commented Feb 13, 2024

What does this PR do?

With HSDP, there are multiple replicas of a model, and multiple shards per replica. There is a race condition when multiple shards try to download the same file, which causes higher network latency, resulting in a longer time needed for checkpoint resumption.

We fix the issue by deduplicating across the shards in the first replica (only the first replica downloads the files) by first gathering all the required files, and then checking if a lesser ranked shard is already downloading the file.

mvpatel2000
mvpatel2000 previously approved these changes Feb 13, 2024
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!! This is super clean :D

jessechancy and others added 3 commits February 13, 2024 11:28
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor nits

composer/utils/checkpoint.py Outdated Show resolved Hide resolved
composer/utils/checkpoint.py Outdated Show resolved Hide resolved
composer/utils/checkpoint.py Outdated Show resolved Hide resolved
jessechancy and others added 4 commits February 13, 2024 13:15
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
@mvpatel2000 mvpatel2000 merged commit 30e6525 into mosaicml:dev Feb 13, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants