Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the FSDP.optim_state_dict_to_load OOM #3184

Merged
merged 12 commits into from
Apr 10, 2024
Merged

Fix the FSDP.optim_state_dict_to_load OOM #3184

merged 12 commits into from
Apr 10, 2024

Conversation

bigning
Copy link
Contributor

@bigning bigning commented Apr 10, 2024

What does this PR do?

Fix the FSDP.optim_state_dict_to_load OOM, it's already in pytorch>=2.3.0 pytorch/pytorch#117261

test

  1. before the change, it oom in the first forward dbrx-dense-20b-debug-autoresume-AdiVZi

here is the memory before the forward:
image

  1. after the change, it can train dbrx-dense-20b-debug-autoresume-3QVblX
image

@bigning bigning marked this pull request as ready for review April 10, 2024 17:10
@bigning bigning changed the title up Fix the FSDP.optim_state_dict_to_load OOM Apr 10, 2024
Copy link
Contributor

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a crazy amount of extra memory usage, damn.

Can you make the PR title more descriptive instead of "up"? other than that, one minor comment, otherwise LGTM! thanks for finding and fixing this so quick.

composer/trainer/mosaic_fsdp_utils.py Show resolved Hide resolved
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, only for 2.2.2 please!

@bigning bigning enabled auto-merge (squash) April 10, 2024 19:03
@bigning bigning merged commit 52776a7 into dev Apr 10, 2024
14 checks passed
@bigning bigning deleted the fix-autoresume-oom branch April 10, 2024 20:17
staghado pushed a commit to lightonai/composer that referenced this pull request Apr 13, 2024
* up

* up

* up

* a

* a

* up

* up

* comments

* up

* lint

* line
staghado pushed a commit to lightonai/composer that referenced this pull request Apr 13, 2024
* up

* up

* up

* a

* a

* up

* up

* comments

* up

* lint

* line
j316chuck pushed a commit that referenced this pull request May 16, 2024
* up

* up

* up

* a

* a

* up

* up

* comments

* up

* lint

* line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants