Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to select individual TPU core to train on #1729

Merged

Conversation

lezwon
Copy link
Contributor

@lezwon lezwon commented May 4, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1539

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team May 4, 2020 16:03
@williamFalcon
Copy link
Contributor

@lezwon there shouldn't be a new flag, it should behave like gpus...

And I guess we need to remove the num_tpu_cores arg and replace with the tpu_cores?

Trainer(tpu_cores=1)
Trainer(tpu_cores=8)
Trainer(tpu_cores=[2])
Trainer(tpu_cores=[6])

@PyTorchLightning/core-contributors

@Borda
Copy link
Member

Borda commented May 5, 2020

@lezwon there shouldn't be a new flag, it should behave like gpus...
And I guess we need to remove the num_tpu_cores arg and replace with the tpu_cores?

Do we really need a new flag? Can it be solved by generalized gpus... Rename gpus to something meaningful for both GPU / TPU and eventually CPU as we have also distributed cpu backend... What a out just cores as you cannot mix run on GPU - TPU in the same time...

@lezwon
Copy link
Contributor Author

lezwon commented May 5, 2020

@lezwon there shouldn't be a new flag, it should behave like gpus...

And I guess we need to remove the num_tpu_cores arg and replace with the tpu_cores?

Trainer(tpu_cores=1)
Trainer(tpu_cores=8)
Trainer(tpu_cores=[2])
Trainer(tpu_cores=[6])

@PyTorchLightning/core-contributors

@williamFalcon I suppose something like Trainer(tpu_cores=[2]) would do. But Trainer(tpu_cores=[2, 3]) would not work as this is not supported yet. Can we finalize the implementation by taking into account the suggestions given by @Borda ? I like the idea of having just cores.

@williamFalcon
Copy link
Contributor

williamFalcon commented May 5, 2020

I think we need to be explicit about GPUs and TPUs... let's keep:

Trainer(gpus)
Trainer(tpu_cores=)

The reason is that a TPU has many cores whereas a GPU is a single unit... TPU cores can also be 128, 1024, etc... if ran on a pod...

@Borda Borda changed the title [WIP] Feature/1539 Allow user to select individual TPU core to train on [WIP] Allow user to select individual TPU core to train on May 5, 2020
@Borda Borda added the feature Is an improvement or enhancement label May 5, 2020
@@ -90,6 +90,7 @@ def __init__(
gpus: Optional[Union[List[int], str, int]] = None,
auto_select_gpus: bool = False,
num_tpu_cores: Optional[int] = None,
tpu_id: Optional[int] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I can use only one TPU? not several with indexes like GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as per my knowledge. xla_distributed only supports 1 or 8 cores. We can't selectively choose the cores.
Ref: https://pytorch.org/xla/release/1.5/index.html#torch_xla.distributed.xla_multiprocessing.spawn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tpu_id is not needed...

Copy link
Contributor Author

@lezwon lezwon May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamFalcon I have replaced it with tpu_cores as you suggested. Valid values are 1/8/[<1-(max_cores)>]

@mergify mergify bot requested a review from a team May 5, 2020 19:41
@lezwon lezwon force-pushed the feature/1539_tpu_train_parallel branch from 743bc4a to c0a4f9d Compare May 6, 2020 16:03
@lezwon lezwon marked this pull request as ready for review May 7, 2020 01:55
@mergify
Copy link
Contributor

mergify bot commented May 7, 2020

This pull request is now in conflict... :(

@lezwon lezwon changed the title [WIP] Allow user to select individual TPU core to train on Allow user to select individual TPU core to train on May 7, 2020
@codecov
Copy link

codecov bot commented May 7, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@692f302). Click here to learn what that means.
The diff coverage is 75%.

@@           Coverage Diff            @@
##             master   #1729   +/-   ##
========================================
  Coverage          ?     88%           
========================================
  Files             ?      69           
  Lines             ?    4163           
  Branches          ?       0           
========================================
  Hits              ?    3674           
  Misses            ?     489           
  Partials          ?       0           

@mergify
Copy link
Contributor

mergify bot commented May 9, 2020

This pull request is now in conflict... :(

@lezwon
Copy link
Contributor Author

lezwon commented May 9, 2020

@williamFalcon @Borda I need a review on this PR.

@@ -498,7 +499,7 @@ def single_gpu_train(self, model):

def tpu_train(self, tpu_core_idx, model):
# put model on tpu
model.to(xm.xla_device())
model.to(xm.xla_device(self.tpu_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this now makes it ONLY possible to train on 1 core no? not multiple cores

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so... @lezwon ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed that if self.tpu_id is None and I use xmp.spawn, the model trains at the same speed it trains when all cores are being used. So I assumed that all cores are being used. I could add some logging to confirm. Or just add a conditional for xm.xla_device() maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ONLY when the user requests a specific TPU index should we use
model.to(xm.xla_device(self.tpu_id)) otherwise, leave it as it was.

@Borda we need TPU tests to make sure this PR doesn't break functionality

@mergify mergify bot requested a review from a team May 9, 2020 12:32
@lezwon lezwon marked this pull request as ready for review May 14, 2020 06:40
@lezwon lezwon changed the title [WIP] Allow user to select individual TPU core to train on Allow user to select individual TPU core to train on May 14, 2020
@lezwon lezwon changed the title Allow user to select individual TPU core to train on [WIP] Allow user to select individual TPU core to train on May 14, 2020
@lezwon lezwon marked this pull request as draft May 14, 2020 06:54
@lezwon lezwon marked this pull request as ready for review May 15, 2020 14:41
@lezwon lezwon changed the title [WIP] Allow user to select individual TPU core to train on Allow user to select individual TPU core to train on May 15, 2020
@mergify
Copy link
Contributor

mergify bot commented May 15, 2020

This pull request is now in conflict... :(

@lezwon
Copy link
Contributor Author

lezwon commented May 16, 2020

@Borda I have made the requested changes. Need your review on it :]

@williamFalcon
Copy link
Contributor

@Borda @lezwon this is great! let's merge?

@williamFalcon williamFalcon added the ready PRs ready to be merged label May 17, 2020
@lezwon
Copy link
Contributor Author

lezwon commented May 17, 2020

@williamFalcon sure.. Let's do it 👍

@Borda
Copy link
Member

Borda commented May 17, 2020

give me a sec to check it...

tests/test_deprecated.py Show resolved Hide resolved
@@ -189,7 +189,10 @@ def __init__(
GPUs are configured to be in "exclusive mode", such
that only one process at a time can access them.

num_tpu_cores: How many TPU cores to train on (1 or 8).
tpu_cores: How many TPU cores to train on (1 or 8) / Single TPU to train on [1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just 1 or 8, nothing in between?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The nprocs argument for xm.spawn supports either 1 or max number of devices.
Source: https://pytorch.org/xla/release/1.5/index.html#torch_xla.distributed.xla_multiprocessing.spawn


if tpu_cores is None:
tpu_cores = num_tpu_cores
self.on_tpu = tpu_cores is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not directly related to this PR, but it is not clear what will happen if a user sets gpus and tpu_cores in the same time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check this out and provide an update :]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Borda I ran this on kaggle:
trainer = pl.Trainer(tpu_cores=[1], gpus=[2], precision=16, max_epochs=20)
It threw an Exception:

MisconfigurationException: 
                You requested GPUs: [2]
                But your machine only has: []

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, it was not relaying to this PR comment but more for us as conceptual think how to handle these configurations...
what would be you expected behaviour as a user if you set gpu and tpu?

  • take any of them if available
  • which has higher priority (to be used) if both are available

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved
@mergify mergify bot requested a review from a team May 17, 2020 15:19
@williamFalcon williamFalcon merged commit 7c7e50c into Lightning-AI:master May 17, 2020
Borda added a commit that referenced this pull request May 17, 2020
Borda added a commit that referenced this pull request May 25, 2020
Borda added a commit that referenced this pull request May 25, 2020
Borda added a commit that referenced this pull request May 28, 2020
williamFalcon pushed a commit that referenced this pull request May 31, 2020
* fix chlog

* test for #1729

* hist

* update

* Document use case of passing test dataloaders to Trainer.test() (#1992)

* Issue 1990 Doc patch.

* Codeblock directive.

* Update to reflect current state of pytorch-lightning

* Final grammar cleaning. I hope these commits are squashed.

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: authman <uapatira@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
justusschock pushed a commit that referenced this pull request Jun 29, 2020
* fix chlog

* test for #1729

* hist

* update

* Document use case of passing test dataloaders to Trainer.test() (#1992)

* Issue 1990 Doc patch.

* Codeblock directive.

* Update to reflect current state of pytorch-lightning

* Final grammar cleaning. I hope these commits are squashed.

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: authman <uapatira@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement ready PRs ready to be merged waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support training multiple models in parallel on each TPU core.
4 participants