Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sagemaker DDP Plugin #6271

Closed
wants to merge 47 commits into from
Closed

Conversation

kaushikb11
Copy link
Contributor

What does this PR do?

Add Sagemaker DDP Plugin. Reference

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@kaushikb11 kaushikb11 requested a review from tchaton March 1, 2021 19:15
@codecov
Copy link

codecov bot commented Mar 1, 2021

Codecov Report

Merging #6271 (4a0f78d) into master (8193bae) will decrease coverage by 7%.
The diff coverage is 39%.

❗ Current head 4a0f78d differs from pull request most recent head 1ac88a8. Consider uploading reports for the commit 1ac88a8 to get more accurate results

@@           Coverage Diff            @@
##           master   #6271     +/-   ##
========================================
- Coverage      93%     86%     -7%     
========================================
  Files         212     216      +4     
  Lines       13720   14070    +350     
========================================
- Hits        12751   12035    -716     
- Misses        969    2035   +1066     

@Borda Borda requested review from Borda and SeanNaren March 2, 2021 17:00
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good !
Questions:

  • I think you can subclass DDPPlugin directly.
  • What about DDPSpawn too ?
  • And Should we add an example with Sagemaker API ?


def __init__(self):
if not _SMDIST_AVAILABLE:
raise MisconfigurationException("`smdistributed` module is not available.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a small description on how to make this work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise MisconfigurationException("`smdistributed` module is not available.")
raise MisconfigurationException("`smdistributed` package is not available.")

also add how to instal it

pytorch_lightning/plugins/training_type/smddp.py Outdated Show resolved Hide resolved
pytorch_lightning/plugins/training_type/smddp.py Outdated Show resolved Hide resolved
pytorch_lightning/plugins/training_type/smddp.py Outdated Show resolved Hide resolved
pytorch_lightning/plugins/training_type/smddp.py Outdated Show resolved Hide resolved
Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it just a few minor comments

pytorch_lightning/plugins/training_type/smddp.py Outdated Show resolved Hide resolved
pytorch_lightning/plugins/training_type/smddp.py Outdated Show resolved Hide resolved

self.sync_batchnorm = sync_batchnorm
self.dist = SMLightningDistributed()
self.num_nodes = len(os.environ['SM_HOSTS'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we maybe should extend the Environment class by?

cc @awaelchli

pytorch_lightning/utilities/enums.py Outdated Show resolved Hide resolved
@kaushikb11 kaushikb11 marked this pull request as ready for review March 10, 2021 11:34
@pep8speaks
Copy link

pep8speaks commented Jul 3, 2021

Hello @kaushikb11! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-14 12:54:55 UTC

@mergify mergify bot removed the has conflicts label Jul 3, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

Comment on lines +47 to +49
"`smdistributed` module is not available."
" You would need to enable distributed=smdistributed"
" in the Sagemaker Estimator Object."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"`smdistributed` module is not available."
" You would need to enable distributed=smdistributed"
" in the Sagemaker Estimator Object."
"`smdistributed` package is not available."
" You would need to enable `distributed=smdistributed` in the Sagemaker Estimator Object."

if not _SMDIST_AVAILABLE:
raise MisconfigurationException(
"`smdistributed` module is not available."
" You would need to enable distributed=smdistributed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean adding to to DDPSMPlugin or where?


def __init__(self):
if not _SMDIST_AVAILABLE:
raise MisconfigurationException("`smdistributed` module is not available.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise MisconfigurationException("`smdistributed` module is not available.")
raise MisconfigurationException("`smdistributed` package is not available.")

also add how to instal it

Comment on lines 112 to 115
log.info("-" * 100)
log.info(f"distributed_backend={self.distributed_backend}")
log.info(f"All DDP processes registered. Starting ddp with {self.world_size} processes")
log.info("-" * 100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.info("-" * 100)
log.info(f"distributed_backend={self.distributed_backend}")
log.info(f"All DDP processes registered. Starting ddp with {self.world_size} processes")
log.info("-" * 100)
log.info(
"-" * 100 + '\n',
f"distributed_backend={self.distributed_backend}" + '\n',
f"All DDP processes registered. Starting ddp with {self.world_size} processes" + '\n',
"-" * 100,
)

@kaushikb11 kaushikb11 marked this pull request as draft July 8, 2021 09:52
@kaushikb11
Copy link
Contributor Author

Converted the PR to draft, because seems like certain functionalities have changed and are breaking from Sagemaker's side.

@Borda Borda modified the milestones: v1.4, v1.5 Jul 19, 2021

class SMLightningDistributed(LightningDistributed):

def broadcast(self, obj: Any, group=sm_dist.group.WORLD):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sm_dist is conditionally imported. So this line fails when it's not present in cases where you train without that package.

@devashish-khatwani
Copy link

Is there an ETA for this feature to be available?

@awaelchli awaelchli modified the milestones: v1.5, v1.6 Nov 1, 2021
@carmocca carmocca removed this from the 1.6 milestone Mar 28, 2022
@lballes
Copy link

lballes commented Aug 9, 2022

Are there any plans to pick this up again? It seems that Sagemaker DDP now functions as a backend for torch.distributed, so it might be simpler to integrate now?

@carmocca
Copy link
Contributor

carmocca commented Aug 9, 2022

@lballes Thank you for the heads up! Then we support it automatically just with:

from pytorch_lightning.strategies import DDPStrategy

# sagemaker backend: https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html
import smdistributed.dataparallel.torch.torch_smddp
ddp = DDPStrategy(process_group_backend="smddp")

# Configure the strategy on the Trainer
trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)

edit: also, a blogpost! https://aws.amazon.com/blogs/machine-learning/run-pytorch-lightning-and-native-pytorch-ddp-on-amazon-sagemaker-training-featuring-amazon-search/

The sagemaker release has deprecated all the constructs in this PR, so we can close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.