Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: Accelerator refactor #5743

Merged
merged 314 commits into from
Feb 12, 2021
Merged
Show file tree
Hide file tree
Changes from 179 commits
Commits
Show all changes
314 commits
Select commit Hold shift + click to select a range
259c7f7
restoring the result from subprocess
awaelchli Dec 22, 2020
dfab52a
fix queue.get() order for results
awaelchli Dec 22, 2020
6742488
add missing "block_backward_sync" context manager
awaelchli Dec 22, 2020
8c89932
add missing "block_backward_sync" context manager
awaelchli Dec 22, 2020
0186a0f
fix sync_batchnorm
awaelchli Dec 22, 2020
b2ac1f4
fix supported gpu-ids for tuple
awaelchli Dec 22, 2020
07a41ce
fix clip gradients and inf recursion
awaelchli Dec 22, 2020
63b7eaf
accelerator selection: added cluster_environment plugin
awaelchli Dec 23, 2020
f8344c5
fix torchelastic test
awaelchli Dec 23, 2020
34e3c15
fix reduce early stopping decision for DDP
awaelchli Dec 24, 2020
27a4cff
fix tests: callbacks, conversion to lightning optimizer
awaelchli Dec 24, 2020
df5ac30
fix lightning optimizer does not pickle
awaelchli Dec 24, 2020
dcf917a
fix setting benchmark and deterministic option
awaelchli Dec 24, 2020
272f088
fix slurm amp test
awaelchli Dec 24, 2020
4529476
fix prepare_data test and determine node_rank
awaelchli Dec 27, 2020
5319b0f
fix retrieving last path when testing
awaelchli Dec 27, 2020
3b54cfb
remove obsolete plugin argument
awaelchli Dec 27, 2020
6540b87
fix test: test_trainer_config
awaelchli Dec 27, 2020
6b450e1
fix torchscript tests
awaelchli Dec 27, 2020
4ef539f
fix trainer.model access
awaelchli Dec 27, 2020
1001ccf
move properties
awaelchli Dec 27, 2020
38a1d0f
fix test_transfer_batch_hook
awaelchli Dec 27, 2020
46cf7ef
fix auto_select_gpus
awaelchli Dec 27, 2020
258f50e
fix omegaconf test
awaelchli Dec 27, 2020
a5d69b9
fix test that needs to simulate slurm ddp
awaelchli Dec 27, 2020
88a7ed5
add horovod plugin
awaelchli Dec 29, 2020
40daa41
fix test with named arguments
awaelchli Dec 29, 2020
96fc074
clean up whitespace
awaelchli Dec 29, 2020
210831a
fix datamodules test
awaelchli Dec 29, 2020
98b6dd4
remove old accelerators
justusschock Jan 6, 2021
dfcbba6
fix naming
justusschock Jan 6, 2021
348a1b0
move old plugins
justusschock Jan 6, 2021
14f2f6e
move to plugins
justusschock Jan 6, 2021
2f779c6
create precision subpackage
justusschock Jan 6, 2021
58536f6
create training_type subpackage
justusschock Jan 6, 2021
ee53c90
fix all new import errors
awaelchli Jan 7, 2021
894e604
fix wrong arguments order passed to test
awaelchli Jan 7, 2021
2bdc836
fix LR finder
awaelchli Jan 10, 2021
48b9882
Added sharded training type and amp plugin
Jan 11, 2021
38452b6
Move clip grad to precision plugin
Jan 11, 2021
173b22c
Added sharded spawn, select accelerators based on distributed_backend…
Jan 12, 2021
79803f6
Fix import issue, attempting to fix tests
Jan 12, 2021
a7c0d8f
Fix initial test
Jan 12, 2021
02df0ad
Reflect hook logic from master, should wrap model after move to device
Jan 14, 2021
d0ebcba
Optional state consolidation, since master has optimizers not wrapped
justusschock Jan 22, 2021
319c3e8
change attribute for instance test
justusschock Jan 22, 2021
a34cd15
reset optimizers
justusschock Jan 22, 2021
c95b06a
legacy
Borda Jan 22, 2021
9ff0c64
imports in accel
Borda Jan 22, 2021
67d4e47
legacy2
Borda Jan 22, 2021
577b00d
trainer imports
Borda Jan 22, 2021
aa4858b
fix import errors after rebase
awaelchli Jan 25, 2021
f81a44f
move hook to new setup location
awaelchli Jan 25, 2021
a285665
provide unwrapping logic
awaelchli Jan 25, 2021
bf78d70
fix trainer callback system
awaelchli Jan 25, 2021
34947cf
added ddp2 implementation
awaelchli Jan 25, 2021
49bec53
fix imports .legacy
Borda Jan 25, 2021
ba1c986
move plugins
Borda Jan 25, 2021
45dfbb7
restore legacy
Borda Jan 25, 2021
9b7326a
drop test.py from root
Borda Jan 25, 2021
96bc05d
add tpu accelerator and plugins
justusschock Jan 26, 2021
c5994e5
Merge branch 'release/1.2-dev' into accelerator-refactor-sharted-4
awaelchli Jan 30, 2021
9e46624
fixes
awaelchli Jan 30, 2021
22d2ae8
Merge branch 'release/1.2-dev' into accelerator-refactor-sharted-4
awaelchli Jan 30, 2021
901d392
Merge branch 'release/1.2-dev' into accelerator-refactor-sharted-4
awaelchli Jan 31, 2021
e174b8d
fix lightning optimizer merge
awaelchli Jan 31, 2021
98660de
reset bugreportmodel
awaelchli Jan 31, 2021
4d95b6c
unwrapping
awaelchli Jan 31, 2021
b69d013
step routing forward
awaelchli Jan 31, 2021
cb6676d
model access
awaelchli Jan 31, 2021
a33d27f
unwrap
awaelchli Jan 31, 2021
f7486e2
opt
awaelchli Jan 31, 2021
117f16d
Merge branch 'release/1.2-dev' into accelerator-refactor-sharted-4
awaelchli Jan 31, 2021
3792b72
integrate distrib_type
awaelchli Jan 31, 2021
ef85b81
sync changes
awaelchli Jan 31, 2021
9d9a940
sync
awaelchli Feb 1, 2021
f017a39
Merge branch 'release/1.2-dev' into accelerator-refactor-sharted-4
awaelchli Feb 1, 2021
a190a56
fixes
awaelchli Feb 1, 2021
73bb607
add forgotten generators
awaelchli Feb 1, 2021
c8c74f3
Merge branch 'release/1.2-dev' into accelerator-refactor-sharted-4
awaelchli Feb 1, 2021
ae71997
add missing logic
awaelchli Feb 1, 2021
d89847b
Merge branch 'release/1.2-dev' into accelerator-refactor-sharted-4
awaelchli Feb 1, 2021
0e686c3
update
awaelchli Feb 1, 2021
d6a43ea
import
awaelchli Feb 1, 2021
ceb8f75
missed imports
awaelchli Feb 1, 2021
fbb7c20
import fixes
awaelchli Feb 1, 2021
b610999
isort
awaelchli Feb 1, 2021
9b79924
mv f
awaelchli Feb 1, 2021
9afe54d
changelog
awaelchli Feb 1, 2021
3b63e82
Merge branch 'release/1.2-dev' into ref/update-plugins
awaelchli Feb 1, 2021
ca8cb68
format
awaelchli Feb 1, 2021
0633745
move helper to parallel plugin
awaelchli Feb 1, 2021
a622e0b
d
awaelchli Feb 1, 2021
18c682f
Merge branch 'ref/update-plugins' into accelerator-refactor-sharted-4
awaelchli Feb 1, 2021
f275803
add world size
awaelchli Feb 1, 2021
4ae008b
clean up
awaelchli Feb 1, 2021
3b3918b
Merge branch 'release/1.2-dev' into accelerator-refactor-sharted-4
awaelchli Feb 1, 2021
d4c6308
duplicate
awaelchli Feb 1, 2021
7eef4a0
Merge branch 'release/1.2-dev' into accelerator-refactor-sharted-4
awaelchli Feb 2, 2021
9949164
activate ddp_sharded and tpu
awaelchli Feb 2, 2021
6d47357
set nvidia flags
awaelchli Feb 2, 2021
a6864ec
remove unused colab var
awaelchli Feb 2, 2021
b4b9724
use_tpu <-> on_tpu attrs
awaelchli Feb 2, 2021
81001e3
make some ddp_cpu and clusterplugin tests pass
awaelchli Feb 2, 2021
cea000d
Ref/accelerator connector (#5742)
justusschock Feb 2, 2021
933e2a1
plugins
awaelchli Feb 2, 2021
ad451d8
manual optimization
justusschock Feb 2, 2021
a30a3cf
update optimizer routing
justusschock Feb 2, 2021
a05b291
add rank to torchelastic
justusschock Feb 2, 2021
4388e73
fix memory mixed precision
awaelchli Feb 2, 2021
be9d029
setstate on trainer for pickling in ddp spawn
awaelchli Feb 2, 2021
a90a160
add predict method
awaelchli Feb 2, 2021
767bee0
add back commented accelerator code
awaelchli Feb 2, 2021
f771a7f
adapt test for sync_batch_norm to new plugin
awaelchli Feb 3, 2021
1a3b04e
fix deprecated tests
awaelchli Feb 3, 2021
a1f4938
fix ddp cpu choice when no num_processes are given
awaelchli Feb 3, 2021
38bc8b7
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
awaelchli Feb 3, 2021
ce6b6de
yapf format
awaelchli Feb 3, 2021
3b7c20b
skip a memory test that cannot pass anymore
awaelchli Feb 3, 2021
f538c75
fix pickle error in spawn plugin
awaelchli Feb 3, 2021
b44d82e
x
awaelchli Feb 3, 2021
3820e77
avoid
awaelchli Feb 3, 2021
08ae327
x
awaelchli Feb 3, 2021
7d0e094
avoid tons of warnings from importing deprecated modules
awaelchli Feb 3, 2021
1028011
fix cyclic import in docs build
awaelchli Feb 3, 2021
11bd0d6
add support for sharded
justusschock Feb 4, 2021
6bf0b60
update typing
justusschock Feb 4, 2021
f94082b
add sharded and sharded_spawn to distributed types
justusschock Feb 4, 2021
7939b99
make unwrap model default
justusschock Feb 4, 2021
9131ffb
refactor LightningShardedDataParallel similar to LightningDistributed…
justusschock Feb 4, 2021
ed7425c
update sharded spawn to reflect changes
justusschock Feb 4, 2021
209a164
update sharded to reflect changes
justusschock Feb 4, 2021
837a070
Merge 1.1.5 changes
awaelchli Feb 4, 2021
136b321
fix merge
awaelchli Feb 4, 2021
ffcb535
fix merge
awaelchli Feb 4, 2021
1edfa73
yapf isort
awaelchli Feb 4, 2021
a689b81
merge 1.1.6
awaelchli Feb 4, 2021
330b14c
fix merge
awaelchli Feb 4, 2021
ef258d5
yapf isort
awaelchli Feb 4, 2021
c85000d
fix indentation in test
awaelchli Feb 4, 2021
5f3a35e
copy over reinit scheduler implementation from dev1.2
awaelchli Feb 4, 2021
fa1c9b7
fix apex tracking calls with dev_debugger
awaelchli Feb 5, 2021
e330a11
reduce diff to dev1.2, clean up
awaelchli Feb 5, 2021
994ac82
fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu
awaelchli Feb 5, 2021
1a78601
sort plugin tests legacy/new
awaelchli Feb 6, 2021
4b76448
fix error handling for amp on cpu
awaelchli Feb 6, 2021
bfd54ab
Merge branch 'release/1.2-dev' into patch117
awaelchli Feb 6, 2021
0574d22
fix merge
awaelchli Feb 6, 2021
6ef6637
Merge branch 'patch117' into accelerator-refactor-sharded
awaelchli Feb 6, 2021
9feda39
[Feat] Resolve manual_backward (#5837)
tchaton Feb 6, 2021
7bb9d9f
fix tests/accelerator tests on cpu
awaelchli Feb 6, 2021
13ae1ff
[BugFix] Resolve manual optimization (#5852)
tchaton Feb 6, 2021
fc3b4db
Merge formatting changes from 1.2 branch
awaelchli Feb 6, 2021
b437642
Remove copy trainer parameters to happen earlier within the loop and …
SeanNaren Feb 7, 2021
8c6aa83
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
Feb 7, 2021
beb980a
resovle a bug
Feb 7, 2021
7a0fd27
Accelerator refactor sharded rpc (#5854)
justusschock Feb 7, 2021
0d0ced5
resolve bug
Feb 7, 2021
1f3ab76
fix assert in rpc test
awaelchli Feb 7, 2021
f1b1121
resolve a test
Feb 7, 2021
cd31fa1
fix docs compilation
awaelchli Feb 8, 2021
f48793e
accelerator refactor - fix for sharded parity test (#5866)
awaelchli Feb 8, 2021
81ff6ea
Remove DDP2 as this does not apply
Feb 8, 2021
20deb46
Add missing pre optimizer hook to ensure lambda closure is called
Feb 8, 2021
be4d1a2
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
Feb 8, 2021
0ac5fc4
fix apex docstring
awaelchli Feb 8, 2021
07fdd95
[accelerator][BugFix] Resolve some test for 1 gpu (#5863)
tchaton Feb 8, 2021
384b791
yapf isort
awaelchli Feb 8, 2021
b1a84b8
resolve flake8
tchaton Feb 8, 2021
a157a29
fix apex doctests
awaelchli Feb 8, 2021
08cfc65
fix apex doctests 2
awaelchli Feb 8, 2021
7888bfd
resolve docs
tchaton Feb 8, 2021
b5b4243
update drone
tchaton Feb 8, 2021
93ceb4c
Merge branch 'accelerator-refactor-sharded' of https://github.com/PyT…
tchaton Feb 8, 2021
d001bcf
clean env
Feb 8, 2021
ad47f47
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
tchaton Feb 8, 2021
60bfb1a
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
tchaton Feb 8, 2021
0608a41
update
Feb 8, 2021
f0120b5
update
Feb 8, 2021
bf8874e
Merge branch 'accelerator-refactor-sharded' of https://github.com/PyT…
Feb 8, 2021
baf7d7f
update
tchaton Feb 8, 2021
9360aad
update
tchaton Feb 8, 2021
b814cdc
merge
justusschock Feb 9, 2021
0d3ea37
Merge branch 'accelerator-refactor-sharded' of github.com:PytorchLigh…
justusschock Feb 9, 2021
f1f90c2
Fix RPC related tests, clean out old API, update for new accelerator …
SeanNaren Feb 9, 2021
6d05881
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
justusschock Feb 10, 2021
d86fdff
Update test_remove_1-4.py
justusschock Feb 10, 2021
5fbc1cf
Expose properties for tpu cores/gpus/num_gpus
Feb 10, 2021
aa9aea0
Add root GPU property
Feb 10, 2021
c35baf1
Move properties to properties.py
Feb 10, 2021
a9c6e21
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
awaelchli Feb 10, 2021
8f3947b
move tests that were previously in drone
awaelchli Feb 10, 2021
50ecc4a
Fix root GPU property (#5908)
SeanNaren Feb 10, 2021
c7d0075
fix best model path transfer when no checkpoint callback available
awaelchli Feb 10, 2021
3f61d15
Merge remote-tracking branch 'original/accelerator-refactor-sharded' …
awaelchli Feb 10, 2021
061ea46
Fix setup hook order [wip] (#5858)
SeanNaren Feb 10, 2021
1fe1f91
rename ddp sequential -> rpc sequential for special test
awaelchli Feb 10, 2021
3683f5a
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
awaelchli Feb 10, 2021
1f01b81
revert
awaelchli Feb 10, 2021
135c236
fix stupid merge problem
awaelchli Feb 10, 2021
222653d
Use property in connector for sampler (#5913)
SeanNaren Feb 10, 2021
f4311cd
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
awaelchli Feb 11, 2021
b210dee
merge the import conflicts
awaelchli Feb 11, 2021
236009e
fix spawning of processes in slurm
awaelchli Feb 11, 2021
aace276
[wip] Fix some bugs for TPU [skip ci] (#5878)
tchaton Feb 11, 2021
68273f5
resolve some tests
Feb 11, 2021
ca77fa4
update
Feb 11, 2021
c35edfd
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
justusschock Feb 11, 2021
8cacef7
fix imports
justusschock Feb 11, 2021
f7bbe48
update
Feb 11, 2021
30d9800
Merge branch 'accelerator-refactor-sharded' of https://github.com/PyT…
Feb 11, 2021
25f7f13
resolve flake8
tchaton Feb 11, 2021
fa28c41
update azure pipeline
tchaton Feb 11, 2021
51c27e6
Merge branch 'release/1.2-dev' into accelerator-refactor-sharded
tchaton Feb 11, 2021
b888d68
skip a sharded test on cpu that requires a gpu
awaelchli Feb 11, 2021
01ca4cd
resolve tpus
Feb 11, 2021
181d143
Merge branch 'master' into accelerator-refactor-sharded
justusschock Feb 11, 2021
946a1e9
resolve bug
Feb 11, 2021
2ad1a6e
Merge branch 'accelerator-refactor-sharded' of https://github.com/PyT…
Feb 11, 2021
6e0aff0
resolve flake8
tchaton Feb 11, 2021
a931791
update
Feb 11, 2021
319d034
Merge branch 'accelerator-refactor-sharded' of https://github.com/PyT…
Feb 11, 2021
4117bec
updat utils
Feb 11, 2021
8d000f7
Merge branch 'master' into accelerator-refactor-sharded
tchaton Feb 11, 2021
0b1ba67
revert permission change on files
awaelchli Feb 11, 2021
cc385b4
suggestions from carlos
awaelchli Feb 11, 2021
e9eb318
remove unrelated formatting changes
awaelchli Feb 11, 2021
7c08400
remove incomplete comment
awaelchli Feb 11, 2021
7c3d184
Update pytorch_lightning/accelerators/__init__.py
awaelchli Feb 11, 2021
503426e
remove unrelated formatting change
awaelchli Feb 11, 2021
c0fbf7a
add types
awaelchli Feb 11, 2021
23a9a10
warn 1.7 ddp manual backward only if ddp kwarg unset
awaelchli Feb 11, 2021
a70ee4a
yapf + isort
awaelchli Feb 11, 2021
b0621c4
pep8 unused imports
awaelchli Feb 11, 2021
18bfe70
Merge branch 'master' into accelerator-refactor-sharded
awaelchli Feb 11, 2021
7b0515d
fix cyclic import in docs
awaelchli Feb 12, 2021
d966057
Apply suggestions from code review
Borda Feb 12, 2021
f636d9d
typer in accelerator.py
Borda Feb 12, 2021
5579ea7
typo
tchaton Feb 12, 2021
f5df88b
Apply suggestions from code review
Borda Feb 12, 2021
233694e
formatting
Borda Feb 12, 2021
a47644a
update on comments
tchaton Feb 12, 2021
80dacb6
update typo
tchaton Feb 12, 2021
99573eb
Update pytorch_lightning/trainer/properties.py
tchaton Feb 12, 2021
ab859d7
update
tchaton Feb 12, 2021
ad5742a
suggestion from code review
awaelchli Feb 12, 2021
5eaec98
suggestion from code review
awaelchli Feb 12, 2021
941cf77
Merge branch 'master' into accelerator-refactor-sharded
mergify[bot] Feb 12, 2021
8491c29
Merge branch 'master' into accelerator-refactor-sharded
mergify[bot] Feb 12, 2021
bd2b23d
Merge branch 'master' into accelerator-refactor-sharded
mergify[bot] Feb 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 12 additions & 33 deletions benchmarks/test_sharded_parity.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
import os
import platform
import time
from typing import Type, Union
from typing import Type

import pytest
import torch
Expand All @@ -32,10 +32,8 @@
@pytest.mark.skipif(platform.system() == "Windows", reason="Distributed training is not supported on Windows")
@pytest.mark.skipif(not _FAIRSCALE_AVAILABLE, reason="Fairscale is not available")
def test_ddp_sharded_plugin_correctness_one_gpu():
plugin_parity_test(
sharded_parity_test(
gpus=1,
accelerator='ddp_spawn',
plugin=DDPShardedPlugin(),
model_cls=SeedTrainLoaderModel,
)

Expand All @@ -45,11 +43,9 @@ def test_ddp_sharded_plugin_correctness_one_gpu():
@pytest.mark.skipif(platform.system() == "Windows", reason="Distributed training is not supported on Windows")
@pytest.mark.skipif(not _FAIRSCALE_AVAILABLE, reason="Fairscale is not available")
def test_ddp_sharded_plugin_correctness_amp_one_gpu():
plugin_parity_test(
sharded_parity_test(
gpus=1,
precision=16,
accelerator='ddp_spawn',
plugin=DDPShardedPlugin(),
model_cls=SeedTrainLoaderModel,
)

Expand All @@ -59,10 +55,8 @@ def test_ddp_sharded_plugin_correctness_amp_one_gpu():
@pytest.mark.skipif(platform.system() == "Windows", reason="Distributed training is not supported on Windows")
@pytest.mark.skipif(not _FAIRSCALE_AVAILABLE, reason="Fairscale is not available")
def test_ddp_sharded_plugin_correctness_multi_gpu():
plugin_parity_test(
sharded_parity_test(
gpus=2,
accelerator='ddp_spawn',
plugin=DDPShardedPlugin(),
model_cls=SeedTrainLoaderModel,
max_percent_speed_diff=0.25, # todo: Increase speed diff since only 2 GPUs sharding 2 optimizers
)
Expand All @@ -73,11 +67,9 @@ def test_ddp_sharded_plugin_correctness_multi_gpu():
@pytest.mark.skipif(torch.cuda.device_count() < 2, reason="test requires multi-GPU machine")
@pytest.mark.skipif(not _FAIRSCALE_AVAILABLE, reason="Fairscale is not available")
def test_ddp_sharded_plugin_correctness_amp_multi_gpu():
plugin_parity_test(
sharded_parity_test(
gpus=2,
precision=16,
accelerator='ddp_spawn',
plugin=DDPShardedPlugin(),
model_cls=SeedTrainLoaderModel,
max_percent_speed_diff=0.25, # todo: Increase speed diff since only 2 GPUs sharding 2 optimizers
)
Expand All @@ -88,11 +80,9 @@ def test_ddp_sharded_plugin_correctness_amp_multi_gpu():
@pytest.mark.skipif(torch.cuda.device_count() < 2, reason="test requires multi-GPU machine")
@pytest.mark.skipif(not _FAIRSCALE_AVAILABLE, reason="Fairscale is not available")
def test_ddp_string_sharded_plugin_correctness_amp_multi_gpu():
plugin_parity_test(
sharded_parity_test(
gpus=2,
precision=16,
accelerator='ddp_spawn',
plugin='ddp_sharded',
model_cls=SeedTrainLoaderModel,
max_percent_speed_diff=0.25, # todo: Increase speed diff since only 2 GPUs sharding 2 optimizers
)
Expand All @@ -105,11 +95,9 @@ def test_ddp_string_sharded_plugin_correctness_amp_multi_gpu():
)
@DDPLauncher.run("--accelerator ddp --gpus 2 --precision 32")
def test_ddp_sharded_plugin_correctness_multi_gpu_ddp(tmpdir, args=None):
plugin_parity_test(
sharded_parity_test(
gpus=args.gpus,
precision=args.precision,
accelerator=args.accelerator,
plugin=DDPShardedPlugin(),
model_cls=SeedTrainLoaderModel,
)

Expand All @@ -121,11 +109,9 @@ def test_ddp_sharded_plugin_correctness_multi_gpu_ddp(tmpdir, args=None):
)
@DDPLauncher.run("--accelerator ddp --gpus 2 --precision 16")
def test_ddp_sharded_plugin_correctness_amp_multi_gpu_ddp(tmpdir, args=None):
plugin_parity_test(
sharded_parity_test(
gpus=args.gpus,
precision=args.precision,
accelerator=args.accelerator,
plugin=DDPShardedPlugin(),
model_cls=SeedTrainLoaderModel,
)

Expand All @@ -138,10 +124,8 @@ def test_ddp_sharded_plugin_correctness_multi_gpu_multi_optim():
"""
Ensures same results using multiple optimizers across multiple GPUs
"""
plugin_parity_test(
plugin=DDPShardedPlugin(),
sharded_parity_test(
gpus=2,
accelerator='ddp_spawn',
model_cls=SeedTrainLoaderMultipleOptimizersModel,
max_percent_speed_diff=0.25, # todo: Increase speed diff since only 2 GPUs sharding 2 optimizers
)
Expand All @@ -155,10 +139,8 @@ def test_ddp_sharded_plugin_correctness_multi_gpu_multi_optim_manual(tmpdir):
"""
Ensures using multiple optimizers across multiple GPUs with manual optimization
"""
plugin_parity_test(
plugin=DDPShardedPlugin(),
sharded_parity_test(
gpus=2,
accelerator='ddp_spawn',
model_cls=SeedTrainLoaderManualModel,
max_percent_speed_diff=0.25, # todo: Increase speed diff since only 2 GPUs sharding 2 optimizers
)
Expand Down Expand Up @@ -273,9 +255,7 @@ def plugin_parity_test(

Args:
model_cls: Model class to use for test.
plugin: Plugin to parity test.
seed: Seed for generators. Note that this does not handle the seed for data-loading on multi-process.
accelerator: Accelerator type for test.
gpus: Number of GPUS to enable.
precision: Whether to use AMP or normal FP32 training.
max_percent_speed_diff: The maximum speed difference compared to normal DDP training.
Expand All @@ -293,7 +273,7 @@ def plugin_parity_test(
max_epochs=1,
gpus=gpus,
precision=precision,
accelerator=accelerator,
accelerator='ddp_spawn',
)

max_memory_ddp, ddp_time = record_ddp_fit_model_stats(trainer=trainer, model=ddp_model, use_cuda=use_cuda)
Expand All @@ -307,8 +287,7 @@ def plugin_parity_test(
max_epochs=1,
gpus=gpus,
precision=precision,
accelerator=accelerator,
plugins=[plugin],
accelerator='ddp_sharded_spawn',
)

max_memory_custom, custom_model_time = record_ddp_fit_model_stats(
Expand Down
29 changes: 4 additions & 25 deletions pytorch_lightning/accelerators/__init__.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,4 @@
# Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pytorch_lightning.accelerators.legacy.accelerator import Accelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.cpu_accelerator import CPUAccelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.ddp2_accelerator import DDP2Accelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.ddp_accelerator import DDPAccelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.ddp_cpu_hpc_accelerator import DDPCPUHPCAccelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.ddp_cpu_spawn_accelerator import DDPCPUSpawnAccelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.ddp_hpc_accelerator import DDPHPCAccelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.ddp_spawn_accelerator import DDPSpawnAccelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.dp_accelerator import DataParallelAccelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.gpu_accelerator import GPUAccelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.horovod_accelerator import HorovodAccelerator # noqa: F401
from pytorch_lightning.accelerators.legacy.tpu_accelerator import TPUAccelerator # noqa: F401
from pytorch_lightning.accelerators.accelerator import Accelerator
from pytorch_lightning.accelerators.cpu import CPUAccelerator
from pytorch_lightning.accelerators.gpu import GPUAccelerator
from pytorch_lightning.accelerators.tpu import TPUAccelerator
52 changes: 31 additions & 21 deletions pytorch_lightning/accelerators/accelerator.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,23 @@ def validation_step_end(self, output):
"""
return output

def predict(self, args):
"""The prediction step.

Args:
args: the arguments for the models predict step. Can consist of the following:
batch (:class:`~torch.Tensor` | (:class:`~torch.Tensor`, ...) | [:class:`~torch.Tensor`, ...]):
The output of your :class:`~torch.utils.data.DataLoader`. A tensor, tuple or list.
batch_idx (int): Integer displaying index of this batch
optimizer_idx (int): When using multiple optimizers, this argument will also be present.
hiddens(:class:`~torch.Tensor`): Passed in if
:paramref:`~pytorch_lightning.trainer.trainer.Trainer.truncated_bptt_steps` > 0.

"""
batch = self.to_device(args[0])
args[0] = batch
return self.training_type_plugin.predict(*args)

def process_dataloader(
self, dataloader: Union[Iterable, torch.utils.data.DataLoader]
) -> Union[Iterable, torch.utils.data.DataLoader]:
Expand Down Expand Up @@ -244,45 +261,35 @@ def backward(
def optimizer_step(
self,
optimizer: torch.optim.Optimizer,
current_epoch: int,
batch_idx: int,
opt_idx: int,
lambda_closure: Callable,
**kwargs
):
"""performs the actual optimizer step.

Args:
optimizer: the optimizer performing the step
current_epoch: current training epoch
batch_idx: index of the current batch
opt_idx: index of the current optimizer
lambda_closure: closure calculating the loss value

"""
model_ref = self.lightning_module
is_lbfgs = isinstance(optimizer, torch.optim.LBFGS)
native_amp = (
isinstance(self.precision_plugin, MixedPrecisionPlugin) and self.precision_plugin.backend == AMPType.NATIVE
)

self.precision_plugin.pre_optimizer_step(optimizer, opt_idx)
self.training_type_plugin.pre_optimizer_step(optimizer, opt_idx)

# model hook
res = model_ref.optimizer_step(
epoch=current_epoch,
batch_idx=batch_idx,
optimizer=optimizer,
optimizer_idx=opt_idx,
optimizer_closure=lambda_closure,
on_tpu=False, # TPUAccelerator class sets this as True
using_native_amp=native_amp,
using_lbfgs=is_lbfgs,
)
optimizer.step(closure=lambda_closure, **kwargs)
tchaton marked this conversation as resolved.
Show resolved Hide resolved

self.precision_plugin.post_optimizer_step(optimizer, opt_idx)
self.training_type_plugin.post_optimizer_step(optimizer, opt_idx)
return res

if self.rpc_enabled and self.training_type_plugin.is_main_rpc_process:
tchaton marked this conversation as resolved.
Show resolved Hide resolved

# Initialize optimizer step on main process
self.training_type_plugin.worker_optimizer_step(
model=self.lightning_module,
opt_idx=opt_idx,
**kwargs
)

def optimizer_zero_grad(
self, current_epoch: int, batch_idx: int, optimizer: torch.optim.Optimizer, opt_idx: int
Expand Down Expand Up @@ -374,3 +381,6 @@ def optimizer_state(self, optimizer: Optimizer) -> dict:

def on_save(self, checkpoint):
return checkpoint

def barrier(self, name: Optional[str] = None) -> None:
self.training_type_plugin.barrier(name=name)
awaelchli marked this conversation as resolved.
Show resolved Hide resolved
Loading