Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor Parallelism v2 #3335

Merged
merged 20 commits into from
May 30, 2024
Merged

Tensor Parallelism v2 #3335

merged 20 commits into from
May 30, 2024

Conversation

mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented May 28, 2024

What does this PR do?

Restores #3269, which was originally reverted. Fixes the following issues:

  • DeviceMesh parsed shard and replicate arguments in the wrong order. Fixed run 0-16gpu-save-l4Gcli
  • Raise error for backwards compatibility and give instructions to fix (daily tests passing)

Follow-on Tasks:

  • Switch to Dataclass
  • Load monolith rank0 only support
  • Train TP only
  • Add TP vs. FSDP determinism test (blocked by no monolith checkpointing)
  • When Composer releases, migrate foundry to new API
  • Migrate types in Composer

tests/trainer/test_fsdp.py Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bigning bigning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the e2e that Daniel run and failed in your v1 PR ?

composer/core/state.py Show resolved Hide resolved
@mvpatel2000
Copy link
Contributor Author

Could you add the e2e that Daniel run and failed in your v1 PR ?

Added to description!

@mvpatel2000 mvpatel2000 requested a review from bigning May 30, 2024 14:28
Copy link
Contributor

@b-chu b-chu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we gate the old tests to world size 2? I don't see a reason to test world size 2 and 4

composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
composer/core/state.py Show resolved Hide resolved
composer/core/state.py Show resolved Hide resolved
composer/core/state.py Show resolved Hide resolved
composer/distributed/__init__.py Show resolved Hide resolved
composer/distributed/mosaic_fsdp.py Show resolved Hide resolved
composer/distributed/mosaic_fsdp.py Show resolved Hide resolved
composer/distributed/mosaic_fsdp.py Outdated Show resolved Hide resolved
tests/trainer/test_tp.py Show resolved Hide resolved
@mvpatel2000 mvpatel2000 requested a review from b-chu May 30, 2024 15:24
@mvpatel2000 mvpatel2000 merged commit c9a51d4 into dev May 30, 2024
13 checks passed
@mvpatel2000 mvpatel2000 deleted the mvpatel2000/tp-v3 branch May 30, 2024 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants