Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if automicrobatching is on before using FSDP modules in Eval #3521

Closed
wants to merge 4 commits into from

Conversation

JackZ-db
Copy link
Contributor

@JackZ-db JackZ-db commented Aug 5, 2024

Runs resuming from pre-sync-dropping Composer broke due to FSDP modules not existing in the Eval Loop. This checks for automicrobatching before calling self.fsdp_modules.

Comment on lines 3756 to 3760
if self.state.auto_microbatching:
sync_hook = _create_sync_hook(self.state)
if self.state.fsdp_enabled and len(self.automicrobatch_fsdp_hook_handles) == 0:
self.automicrobatch_fsdp_hook_handles = _readd_fsdp_sync_hooks(self.fsdp_modules, sync_hook)
self.num_consecutive_non_OOM_batches = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this solves the issue -- if you resume with automicrobatching on, then it will still fail. Instead, the top level if statement should also check if attr fsdp_modules exists.

Can you verify also this works with Jacob's run?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand -- shouldn't self.fsdp_modules be set during FSDP wrap on init?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea it doesnt the issue is in checkpointing, where prepare_fsdp_module is called and doesnt return anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants