Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Replace grad norm. try to fix TPU dataloader issue #949

Closed
wants to merge 4 commits into from

Conversation

srush
Copy link
Contributor

@srush srush commented Feb 26, 2020

Still playing around with TPU. I can now train one epoch fast, but when I start the second epoch it just ends with no error. Trying to figure out why that could be:

It seems to be something about this:

pytorch/xla#1191

Not sure if TPU dataloaders need to be specifically restarted?

@pep8speaks
Copy link

Hello @srush! Thanks for opening this PR.

Line 179:13: W503 line break before binary operator
Line 193:13: W503 line break before binary operator
Line 194:13: E128 continuation line under-indented for visual indent
Line 194:13: W503 line break before binary operator

Do see the Hitchhiker's guide to code style

@williamFalcon
Copy link
Contributor

@dlibenzi @mruberry any ideas?

@dlibenzi
Copy link

@dlibenzi @mruberry any ideas?

I've got no error, how can I have an idea? 😄
You are re-creating the ParallelLoader at every EPOCH, right?

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 26, 2020

@dlibenzi
Copy link

@dlibenzi nope! is that the issue?
Currently, we create it right before all the epochs.

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/training_loop.py#L451-L455

Yep, that's the issue 😉

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 26, 2020

I think I found it #957

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 26, 2020

Yes... fixed!

@srush make this PR only about the gradnorm?

image

@williamFalcon
Copy link
Contributor

Check out the video of it working :)

https://twitter.com/PyTorchLightnin/status/1232813118507692033?s=20

@srush
Copy link
Contributor Author

srush commented Feb 27, 2020

Awesome! Closing this guy. Switching to #959

@srush srush closed this Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants