Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new dist_cp save planner to fix issue that each rank needs to download all checkpoint files #3271

Merged
merged 8 commits into from
May 13, 2024

Conversation

bigning
Copy link
Contributor

@bigning bigning commented May 9, 2024

what's the issue

PyTorch Issue: pytorch/pytorch#125740
When pytorch 2.3 saves checkpoint files, if there are duplicated tensors, it will do dedup. In the dedup, it tries to balance the storage needed for each rank, so the duplicated tensors are saved on different ranks. The consequest is that, when loading checkpoint, each rank needs to read files saved by different ranks even without resharding. Ideally, each rank_k only needs to download rank_k and rank_0 files.

what's the fix

we save the duplicated tensors to the smallest rank only. This is how torch2.1 and torch2.2 do the dedup.

test

Tested on 512 GPUs. Save and load both work.

@bigning bigning changed the title Fix lg cp new planner new planner May 9, 2024
@bigning bigning marked this pull request as ready for review May 10, 2024 17:11
@bigning bigning requested review from eracah and milocress May 10, 2024 17:11
@bigning bigning changed the title new planner new dist_cp save planner to fix issue that each rank needs to download all checkpoint files May 10, 2024
@bigning bigning merged commit beb5a35 into dev May 13, 2024
15 checks passed
@bigning bigning deleted the fix_lg_cp_new_planner branch May 13, 2024 17:00
j316chuck pushed a commit that referenced this pull request May 16, 2024
…ad all checkpoint files (#3271)

* a

* a

* a

* a

* lint

* lint

---------

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants