Support training multiple models in parallel on each TPU core. #1539

lezwon · 2020-04-20T18:34:42Z

🚀 Feature

While training a model using K-Fold method, it would be beneficial to train each model parallelly on a separate TPU core. There should be a feature by which we can assign a model training process to a particular TPU core. Similar to gpus=[0,2]

Motivation

I came across this kernel by Abhishek Thakur, wherein he trains multiple ROBERT models in parallel on each TPU core. It trained the model fast by utilizing all cores. I have tried doing the same with Lightning but realized I cant select a TPU core with it.

Pitch

Not very clear about it. Maybe inbuilt support for KFold method wherein the dataset is split accordingly and a model is trained on each TPU code in a Kfold manner. Or just the support to select a core and train on it. e.g tpus=[1]

Additional context

Kernel by Abhishek Thakur:
https://www.kaggle.com/abhishek/super-duper-fast-pytorch-tpu-kernel

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-04-20T22:53:20Z

great suggestion. Let's expand the functionality to take in that list as you mentioned?
mind submitting a PR?

lezwon · 2020-05-04T16:11:32Z

@williamFalcon I have added a tpu_id argument to Trainer which can help the user pick a tpu core while training. I have created a POC notebook here: https://www.kaggle.com/lezwon/pytorch-lightning-parallel-tpu-tranining/

The Kfold logic is implemented outside the trainer object as of now. The models are trained parallelly on each TPU core. Do let me know if this approach is fine. :)

lezwon added feature Is an improvement or enhancement help wanted Open to be worked on labels Apr 20, 2020

williamFalcon added the priority: 0 High priority task label Apr 20, 2020

williamFalcon assigned lezwon Apr 21, 2020

This was referenced May 4, 2020

[wip] Allow user to select individual TPU core to train on lezwon/pytorch-lightning#1

Closed

Allow user to select individual TPU core to train on #1729

Merged

williamFalcon closed this as completed in #1729 May 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support training multiple models in parallel on each TPU core. #1539

Support training multiple models in parallel on each TPU core. #1539

lezwon commented Apr 20, 2020

williamFalcon commented Apr 20, 2020

lezwon commented May 4, 2020

Support training multiple models in parallel on each TPU core. #1539

Support training multiple models in parallel on each TPU core. #1539

Comments

lezwon commented Apr 20, 2020

🚀 Feature

Motivation

Pitch

Additional context

williamFalcon commented Apr 20, 2020

lezwon commented May 4, 2020