Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support training multiple models in parallel on each TPU core. #1539

Closed
lezwon opened this issue Apr 20, 2020 · 2 comments · Fixed by #1729
Closed

Support training multiple models in parallel on each TPU core. #1539

lezwon opened this issue Apr 20, 2020 · 2 comments · Fixed by #1729
Assignees
Labels
feature Is an improvement or enhancement help wanted Open to be worked on priority: 0 High priority task

Comments

@lezwon
Copy link
Contributor

lezwon commented Apr 20, 2020

🚀 Feature

While training a model using K-Fold method, it would be beneficial to train each model parallelly on a separate TPU core. There should be a feature by which we can assign a model training process to a particular TPU core. Similar to gpus=[0,2]

Motivation

I came across this kernel by Abhishek Thakur, wherein he trains multiple ROBERT models in parallel on each TPU core. It trained the model fast by utilizing all cores. I have tried doing the same with Lightning but realized I cant select a TPU core with it.

Pitch

Not very clear about it. Maybe inbuilt support for KFold method wherein the dataset is split accordingly and a model is trained on each TPU code in a Kfold manner. Or just the support to select a core and train on it. e.g tpus=[1]

Additional context

Kernel by Abhishek Thakur:
https://www.kaggle.com/abhishek/super-duper-fast-pytorch-tpu-kernel

@lezwon lezwon added feature Is an improvement or enhancement help wanted Open to be worked on labels Apr 20, 2020
@williamFalcon
Copy link
Contributor

great suggestion. Let's expand the functionality to take in that list as you mentioned?
mind submitting a PR?

@lezwon
Copy link
Contributor Author

lezwon commented May 4, 2020

@williamFalcon I have added a tpu_id argument to Trainer which can help the user pick a tpu core while training. I have created a POC notebook here: https://www.kaggle.com/lezwon/pytorch-lightning-parallel-tpu-tranining/

The Kfold logic is implemented outside the trainer object as of now. The models are trained parallelly on each TPU core. Do let me know if this approach is fine. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
2 participants