Allow user to select individual TPU core to train on #1729

lezwon · 2020-05-04T16:02:42Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

added tpu_id to mixins

williamFalcon · 2020-05-05T02:18:51Z

@lezwon there shouldn't be a new flag, it should behave like gpus...

And I guess we need to remove the num_tpu_cores arg and replace with the tpu_cores?

Trainer(tpu_cores=1)
Trainer(tpu_cores=8)
Trainer(tpu_cores=[2])
Trainer(tpu_cores=[6])

@PyTorchLightning/core-contributors

Borda · 2020-05-05T06:01:36Z

@lezwon there shouldn't be a new flag, it should behave like gpus...
And I guess we need to remove the num_tpu_cores arg and replace with the tpu_cores?

Do we really need a new flag? Can it be solved by generalized gpus... Rename gpus to something meaningful for both GPU / TPU and eventually CPU as we have also distributed cpu backend... What a out just cores as you cannot mix run on GPU - TPU in the same time...

lezwon · 2020-05-05T18:43:24Z

@lezwon there shouldn't be a new flag, it should behave like gpus...

And I guess we need to remove the num_tpu_cores arg and replace with the tpu_cores?
Trainer(tpu_cores=1)
Trainer(tpu_cores=8)
Trainer(tpu_cores=[2])
Trainer(tpu_cores=[6])
@PyTorchLightning/core-contributors

@williamFalcon I suppose something like Trainer(tpu_cores=[2]) would do. But Trainer(tpu_cores=[2, 3]) would not work as this is not supported yet. Can we finalize the implementation by taking into account the suggestions given by @Borda ? I like the idea of having just cores.

williamFalcon · 2020-05-05T18:46:48Z

I think we need to be explicit about GPUs and TPUs... let's keep:

Trainer(gpus)
Trainer(tpu_cores=)

The reason is that a TPU has many cores whereas a GPU is a single unit... TPU cores can also be 128, 1024, etc... if ran on a pod...

Borda · 2020-05-05T19:41:11Z

pytorch_lightning/trainer/trainer.py

@@ -90,6 +90,7 @@ def __init__(
            gpus: Optional[Union[List[int], str, int]] = None,
            auto_select_gpus: bool = False,
            num_tpu_cores: Optional[int] = None,
+            tpu_id: Optional[int] = None,


so I can use only one TPU? not several with indexes like GPU?

Not as per my knowledge. xla_distributed only supports 1 or 8 cores. We can't selectively choose the cores.
Ref: https://pytorch.org/xla/release/1.5/index.html#torch_xla.distributed.xla_multiprocessing.spawn

tpu_id is not needed...

@williamFalcon I have replaced it with tpu_cores as you suggested. Valid values are 1/8/[<1-(max_cores)>]

mergify · 2020-05-07T01:55:50Z

This pull request is now in conflict... :(

codecov · 2020-05-07T02:21:04Z

Codecov Report

❗ No coverage uploaded for pull request base (master@692f302). Click here to learn what that means.
The diff coverage is 75%.

@@           Coverage Diff            @@
##             master   #1729   +/-   ##
========================================
  Coverage          ?     88%           
========================================
  Files             ?      69           
  Lines             ?    4163           
  Branches          ?       0           
========================================
  Hits              ?    3674           
  Misses            ?     489           
  Partials          ?       0

mergify · 2020-05-09T03:32:43Z

This pull request is now in conflict... :(

removed self.tpu_id for ParallelLoader

lezwon · 2020-05-09T08:47:37Z

@williamFalcon @Borda I need a review on this PR.

williamFalcon · 2020-05-09T12:31:43Z

pytorch_lightning/trainer/distrib_parts.py

@@ -498,7 +499,7 @@ def single_gpu_train(self, model):

    def tpu_train(self, tpu_core_idx, model):
        # put model on tpu
-        model.to(xm.xla_device())
+        model.to(xm.xla_device(self.tpu_id))


i think this now makes it ONLY possible to train on 1 core no? not multiple cores

I think so... @lezwon ^^

I have noticed that if self.tpu_id is None and I use xmp.spawn, the model trains at the same speed it trains when all cores are being used. So I assumed that all cores are being used. I could add some logging to confirm. Or just add a conditional for xm.xla_device() maybe?

ONLY when the user requests a specific TPU index should we use
model.to(xm.xla_device(self.tpu_id)) otherwise, leave it as it was.

@Borda we need TPU tests to make sure this PR doesn't break functionality

…train_parallel

mergify · 2020-05-15T14:44:59Z

This pull request is now in conflict... :(

lezwon · 2020-05-16T14:18:14Z

@Borda I have made the requested changes. Need your review on it :]

williamFalcon · 2020-05-17T12:33:03Z

@Borda @lezwon this is great! let's merge?

lezwon · 2020-05-17T12:53:55Z

@williamFalcon sure.. Let's do it 👍

Borda · 2020-05-17T15:05:33Z

give me a sec to check it...

tests/test_deprecated.py

Borda · 2020-05-17T15:12:56Z

pytorch_lightning/trainer/trainer.py

@@ -189,7 +189,10 @@ def __init__(
                GPUs are configured to be in "exclusive mode", such
                that only one process at a time can access them.

-            num_tpu_cores: How many TPU cores to train on (1 or 8).
+            tpu_cores: How many TPU cores to train on (1 or 8) / Single TPU to train on [1]


just 1 or 8, nothing in between?

Yes. The nprocs argument for xm.spawn supports either 1 or max number of devices.
Source: https://pytorch.org/xla/release/1.5/index.html#torch_xla.distributed.xla_multiprocessing.spawn

Borda · 2020-05-17T15:15:06Z

pytorch_lightning/trainer/trainer.py

+
+        if tpu_cores is None:
+            tpu_cores = num_tpu_cores
+        self.on_tpu = tpu_cores is not None


this is not directly related to this PR, but it is not clear what will happen if a user sets gpus and tpu_cores in the same time

I will check this out and provide an update :]

@Borda I ran this on kaggle:
trainer = pl.Trainer(tpu_cores=[1], gpus=[2], precision=16, max_epochs=20)
It threw an Exception:

MisconfigurationException: You requested GPUs: [2] But your machine only has: []

I know, it was not relaying to this PR comment but more for us as conceptual think how to handle these configurations...
what would be you expected behaviour as a user if you set gpu and tpu?

take any of them if available

which has higher priority (to be used) if both are available

pytorch_lightning/trainer/trainer.py

* fix chlog * test for #1729 * hist * update * Document use case of passing test dataloaders to Trainer.test() (#1992) * Issue 1990 Doc patch. * Codeblock directive. * Update to reflect current state of pytorch-lightning * Final grammar cleaning. I hope these commits are squashed. * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: authman <uapatira@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

lezwon added 3 commits May 4, 2020 08:57

added tpu_id

8995fbc

added tpu_id to mixins

train on individual tpu

bd9e88c

parallel loader if tpu_id is None

1daadfa

mergify bot requested a review from a team May 4, 2020 16:03

removed progress_bar_refresh_rate

e4d49d0

Borda changed the title ~~[WIP] Feature/1539 Allow user to select individual TPU core to train on~~ [WIP] Allow user to select individual TPU core to train on May 5, 2020

Borda added the feature Is an improvement or enhancement label May 5, 2020

chlog

0ed38cd

Borda reviewed May 5, 2020

View reviewed changes

mergify bot requested a review from a team May 5, 2020 19:41

lezwon added 2 commits May 6, 2020 18:47

replaced num_tpu_cores with tpu_cores

725ef5d

set tpu_id to None if int

c0a4f9d

lezwon force-pushed the feature/1539_tpu_train_parallel branch from 743bc4a to c0a4f9d Compare May 6, 2020 16:03

changed num_tpu_cores to tpu_cores in docs

f25d516

lezwon marked this pull request as ready for review May 7, 2020 01:55

lezwon changed the title ~~[WIP] Allow user to select individual TPU core to train on~~ Allow user to select individual TPU core to train on May 7, 2020

Merge branch 'master' into feature/1539_tpu_train_parallel

a93c6bc

lezwon and others added 3 commits May 9, 2020 13:05

updated docs

b22f485

Merge branch 'master' into feature/1539_tpu_train_parallel

cdda262

updated __init__.py

0669ad2

removed self.tpu_id for ParallelLoader

williamFalcon reviewed May 9, 2020

View reviewed changes

mergify bot requested a review from a team May 9, 2020 12:32

lezwon added 4 commits May 14, 2020 08:53

fixed recursion error

5c0db30

fixed tests

c7a9b4e

fixed flake errors

83e5d99

Merge remote-tracking branch 'official/master' into feature/1539_tpu_…

230831e

…train_parallel

lezwon marked this pull request as ready for review May 14, 2020 06:40

lezwon changed the title ~~[WIP] Allow user to select individual TPU core to train on~~ Allow user to select individual TPU core to train on May 14, 2020

lezwon changed the title ~~Allow user to select individual TPU core to train on~~ [WIP] Allow user to select individual TPU core to train on May 14, 2020

lezwon marked this pull request as draft May 14, 2020 06:54

removed current_tpu_index

59e0b49

lezwon marked this pull request as ready for review May 15, 2020 14:41

lezwon changed the title ~~[WIP] Allow user to select individual TPU core to train on~~ Allow user to select individual TPU core to train on May 15, 2020

awaelchli mentioned this pull request May 15, 2020

data transfer model hook (+ refactor) #1756

Merged

5 tasks

Merge branch 'master' into feature/1539_tpu_train_parallel

f22d90d

williamFalcon added the ready PRs ready to be merged label May 17, 2020

Update CHANGELOG.md

940f70b

Borda approved these changes May 17, 2020

View reviewed changes

mergify bot requested a review from a team May 17, 2020 15:19

Update trainer.py

ec300ee

williamFalcon merged commit 7c7e50c into Lightning-AI:master May 17, 2020

Borda added a commit that referenced this pull request May 17, 2020

test for #1729

5245a82

Borda added a commit that referenced this pull request May 25, 2020

test for #1729

511384c

Borda added a commit that referenced this pull request May 25, 2020

test for #1729

e98cb55

Borda added a commit that referenced this pull request May 28, 2020

test for #1729

c6dab74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow user to select individual TPU core to train on #1729

Allow user to select individual TPU core to train on #1729

lezwon commented May 4, 2020 •

edited

Loading

williamFalcon commented May 5, 2020

Borda commented May 5, 2020

lezwon commented May 5, 2020

williamFalcon commented May 5, 2020 •

edited

Loading

Borda May 5, 2020

lezwon May 6, 2020

williamFalcon May 6, 2020

lezwon May 7, 2020 •

edited

Loading

mergify bot commented May 7, 2020

codecov bot commented May 7, 2020 •

edited

Loading

mergify bot commented May 9, 2020

lezwon commented May 9, 2020

williamFalcon May 9, 2020

Borda May 9, 2020

lezwon May 9, 2020

williamFalcon May 10, 2020

mergify bot commented May 15, 2020

lezwon commented May 16, 2020

williamFalcon commented May 17, 2020

lezwon commented May 17, 2020

Borda commented May 17, 2020

Borda May 17, 2020

lezwon May 18, 2020

Borda May 17, 2020

lezwon May 18, 2020

lezwon May 18, 2020

Borda May 18, 2020

Allow user to select individual TPU core to train on #1729

Allow user to select individual TPU core to train on #1729

Conversation

lezwon commented May 4, 2020 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

williamFalcon commented May 5, 2020

Borda commented May 5, 2020

lezwon commented May 5, 2020

williamFalcon commented May 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lezwon May 7, 2020 • edited Loading

Choose a reason for hiding this comment

mergify bot commented May 7, 2020

codecov bot commented May 7, 2020 • edited Loading

Codecov Report

mergify bot commented May 9, 2020

lezwon commented May 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented May 15, 2020

lezwon commented May 16, 2020

williamFalcon commented May 17, 2020

lezwon commented May 17, 2020

Borda commented May 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lezwon commented May 4, 2020 •

edited

Loading

williamFalcon commented May 5, 2020 •

edited

Loading

lezwon May 7, 2020 •

edited

Loading

codecov bot commented May 7, 2020 •

edited

Loading