Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix amp tests #661

Merged
merged 3 commits into from
Jan 5, 2020
Merged

Fix amp tests #661

merged 3 commits into from
Jan 5, 2020

Conversation

neggert
Copy link
Contributor

@neggert neggert commented Jan 3, 2020

A few changes to fix AMP-related problems in our tests:

  • With opt_level="O1" (the default), AMP patches many torch functions, which breaks any tests that run afterwards. This patch introduces a pytest extension that lets tests be marked with @pytest.mark.spawn so that they are run in their own process using torch.multiprocessing.spawn so that the main python interpreter stays un-patched.

  • Since AMP defaults to O1 now, DP tests no longer throw exceptions. Remove the pytest.raises that expects the exception.

  • Since AMP patches torch functions, CPU inference no longer works.
    Skip prediction step for AMP tests.

Note that there are still a couple of unrelated test failures that will be addressed in another PR.

With opt_level="O1" (the default), AMP patches many
torch functions, which breaks any tests that run afterwards.
This patch introduces a pytest extension that lets
tests be marked with @pytest.mark.spawn so that they
are run in their own process using torch.multiprocessing.spawn
so that the main python interpreter stays un-patched.

Note that tests using DDP already run AMP in its own process,
so they don't need this annotation.
Since AMP defaults to O1 now, DP tests no longer throw exceptions.

Since AMP patches torch functions, CPU inference no longer works.
Skip prediction step for AMP tests.
@neggert
Copy link
Contributor Author

neggert commented Jan 3, 2020

Travis failure appears to be due to this unrelated issue: pytorch/vision#1718

Copy link
Contributor

@jeffling jeffling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff :)

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIce job 🚀

@@ -124,26 +128,6 @@ def test_amp_gpu_ddp_slurm_managed(tmpdir):
assert trainer.resolve_root_node_address('abc[23-24]') == 'abc23'
assert trainer.resolve_root_node_address('abc[23-24, 45-40, 40]') == 'abc23'

# test model loading with a map_location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just comment it?

@Borda
Copy link
Member

Borda commented Jan 4, 2020

@neggert
Copy link
Contributor Author

neggert commented Jan 4, 2020 via email

@Borda Borda mentioned this pull request Jan 5, 2020
4 tasks
@williamFalcon williamFalcon merged commit 019f612 into Lightning-AI:master Jan 5, 2020
@hugovk
Copy link

hugovk commented Jan 16, 2020

torchvision v0.5.0 has been released with the fix:

  1. Require torchvision>=0.5.0
  2. If Pillow was temporarily pinned, remove the pin

@Borda
Copy link
Member

Borda commented Jan 16, 2020

but torchvision>=0.5.0 has some JIT issues... #694

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants