[WIP] Replace grad norm. try to fix TPU dataloader issue #949

srush · 2020-02-26T03:35:53Z

Still playing around with TPU. I can now train one epoch fast, but when I start the second epoch it just ends with no error. Trying to figure out why that could be:

It seems to be something about this:

pytorch/xla#1191

Not sure if TPU dataloaders need to be specifically restarted?

pep8speaks · 2020-02-26T03:35:59Z

Hello @srush! Thanks for opening this PR.

In the file pytorch_lightning/trainer/data_loading.py:

Line 179:13: W503 line break before binary operator
Line 193:13: W503 line break before binary operator
Line 194:13: E128 continuation line under-indented for visual indent
Line 194:13: W503 line break before binary operator

Do see the Hitchhiker's guide to code style

williamFalcon · 2020-02-26T11:54:34Z

@dlibenzi @mruberry any ideas?

dlibenzi · 2020-02-26T16:45:17Z

@dlibenzi @mruberry any ideas?

I've got no error, how can I have an idea? 😄
You are re-creating the ParallelLoader at every EPOCH, right?

williamFalcon · 2020-02-26T17:47:34Z

@dlibenzi nope! is that the issue?
Currently, we create it right before all the epochs.

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/training_loop.py#L451-L455

dlibenzi · 2020-02-26T17:48:27Z

@dlibenzi nope! is that the issue?
Currently, we create it right before all the epochs.

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/training_loop.py#L451-L455

Yep, that's the issue 😉

williamFalcon · 2020-02-26T21:55:55Z

I think I found it #957

williamFalcon · 2020-02-26T22:52:47Z

Yes... fixed!

@srush make this PR only about the gradnorm?

williamFalcon · 2020-02-26T23:48:10Z

Check out the video of it working :)

https://twitter.com/PyTorchLightnin/status/1232813118507692033?s=20

srush · 2020-02-27T03:44:12Z

Awesome! Closing this guy. Switching to #959

Sasha added 4 commits February 25, 2020 21:42

.

a07f3c3

.

272a574

.

ef9c86c

.

cf20be9

srush closed this Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Replace grad norm. try to fix TPU dataloader issue #949

[WIP] Replace grad norm. try to fix TPU dataloader issue #949

srush commented Feb 26, 2020

pep8speaks commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

dlibenzi commented Feb 26, 2020

williamFalcon commented Feb 26, 2020 •

edited

Loading

dlibenzi commented Feb 26, 2020

williamFalcon commented Feb 26, 2020 •

edited

Loading

williamFalcon commented Feb 26, 2020 •

edited

Loading

williamFalcon commented Feb 26, 2020

srush commented Feb 27, 2020

[WIP] Replace grad norm. try to fix TPU dataloader issue #949

[WIP] Replace grad norm. try to fix TPU dataloader issue #949

Conversation

srush commented Feb 26, 2020

pep8speaks commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

dlibenzi commented Feb 26, 2020

williamFalcon commented Feb 26, 2020 • edited Loading

dlibenzi commented Feb 26, 2020

williamFalcon commented Feb 26, 2020 • edited Loading

williamFalcon commented Feb 26, 2020 • edited Loading

williamFalcon commented Feb 26, 2020

srush commented Feb 27, 2020

williamFalcon commented Feb 26, 2020 •

edited

Loading

williamFalcon commented Feb 26, 2020 •

edited

Loading

williamFalcon commented Feb 26, 2020 •

edited

Loading