Switch Apex with Pytorch #336

blisc · 2020-02-07T00:53:29Z

Switch DistributedDataParallel
Switch SyncBatchNorm
Address gradient_predivide_factor
Address gradient accumulation

This PR switches from apex's DistributedDataParallel to torch's DistributedDataParallel.
Warning: that gradient_predivide_factor is no longer working after this switch
Warning: in multi-gpu runs, neural modules with no weights MUST inherit from NonTrainableNM.

Signed-off-by: Jason <jasoli@nvidia.com>

…o_torch Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-07T01:04:55Z

This pull request fixes 1 alert when merging 675b0fe into 18b528e - view on LGTM.com

fixed alerts:

1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-07T02:21:57Z

This pull request fixes 1 alert when merging d535720 into 18b528e - view on LGTM.com

fixed alerts:

1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-07T02:36:50Z

This pull request fixes 1 alert when merging a9be01a into 18b528e - view on LGTM.com

fixed alerts:

1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>

blisc · 2020-02-07T02:57:36Z

~~For some reason, memory usage shoots up from 12GB -> 20GB, need to find out why~~
Supposedly not an issue, memory is allocated but unused.
It could be worked around via the following at the end of _eval() but I don't think it's necessary:

if self.global_rank == 0:
    del values_dict
del registered_e_tensors
torch.cuda.empty_cache()

lgtm-com · 2020-02-07T03:08:23Z

This pull request introduces 1 alert and fixes 1 when merging 71d4bff into 4f299f4 - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Unused import

lgtm-com · 2020-02-07T23:08:05Z

This pull request introduces 1 alert and fixes 1 when merging 64ccf26 into c6a3cdd - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-08T00:41:20Z

This pull request introduces 1 alert and fixes 1 when merging 7235317 into c6a3cdd - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-08T00:54:22Z

This pull request fixes 1 alert when merging f627f44 into 54a8e9e - view on LGTM.com

fixed alerts:

1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-08T01:39:09Z

This pull request fixes 1 alert when merging b85cb40 into 54a8e9e - view on LGTM.com

fixed alerts:

1 for Unused import

…o_torch Signed-off-by: Jason <jasoli@nvidia.com>

nemo/backends/pytorch/actions.py

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-13T21:07:48Z

This pull request fixes 1 alert when merging f1a57bb into 403238f - view on LGTM.com

fixed alerts:

1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-13T21:17:42Z

This pull request fixes 1 alert when merging dabaea2 into 403238f - view on LGTM.com

fixed alerts:

1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>

…o_torch Signed-off-by: Jason <jasoli@nvidia.com>

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-13T22:47:00Z

This pull request fixes 2 alerts when merging a71da02 into f072029 - view on LGTM.com

fixed alerts:

1 for Unused import
1 for Wrong number of arguments in a class instantiation

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-13T23:02:42Z

This pull request fixes 2 alerts when merging 390c410 into f072029 - view on LGTM.com

fixed alerts:

1 for Unused import
1 for Wrong number of arguments in a class instantiation

Signed-off-by: Jason <jasoli@nvidia.com>

lgtm-com · 2020-02-13T23:54:37Z

This pull request fixes 2 alerts when merging 8c26247 into f072029 - view on LGTM.com

fixed alerts:

1 for Unused import
1 for Wrong number of arguments in a class instantiation

* cli: use non-zero exit status for error scenarios Invoking sys.exit() with no args results in an exit status of zero, which traditionally indicates success. This is not appropriate for error scenarios. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> * lab: exit program after catching and reporting exceptions Creating the app failed with an exception raised due to a missing model file. Execution carried on, however, leading to use of an undefined 'app' variable. Fix this, and a couple of other places which also caught (unrecoverable) exceptions and forgot to exit. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --------- Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

blisc added 2 commits February 6, 2020 14:56

stage 1 of switch DDP

672fd49

Signed-off-by: Jason <jasoli@nvidia.com>

Merge remote-tracking branch 'nvidia/master' into u_switch_apex_ddp_t…

675b0fe

…o_torch Signed-off-by: Jason <jasoli@nvidia.com>

switch from apex to torch

d535720

Signed-off-by: Jason <jasoli@nvidia.com>

simplify contextmangaer

a9be01a

Signed-off-by: Jason <jasoli@nvidia.com>

cleaner version with exitstack as opposed to nested with statements

71d4bff

Signed-off-by: Jason <jasoli@nvidia.com>

blisc force-pushed the u_switch_apex_ddp_to_torch branch from 64ccf26 to 71d4bff Compare February 8, 2020 00:29

isort; update changelog

7235317

Signed-off-by: Jason <jasoli@nvidia.com>

blisc changed the title ~~[WIP] Switch Apex with Pytorch~~ Switch Apex with Pytorch Feb 8, 2020

blisc marked this pull request as ready for review February 8, 2020 00:33

update ZerosLikeNM to be NonTrainable

f627f44

Signed-off-by: Jason <jasoli@nvidia.com>

style

b85cb40

Signed-off-by: Jason <jasoli@nvidia.com>

Merge remote-tracking branch 'nvidia/master' into u_switch_apex_ddp_t…

b765575

…o_torch Signed-off-by: Jason <jasoli@nvidia.com>

tkornuta-nvidia self-requested a review February 10, 2020 23:20

okuchaiev reviewed Feb 11, 2020

View reviewed changes

nemo/backends/pytorch/actions.py Show resolved Hide resolved

blisc added 3 commits February 13, 2020 11:51

merge with master

ef77ffc

Signed-off-by: Jason <jasoli@nvidia.com>

weight sharing fix attempt

c61ea90

Signed-off-by: Jason <jasoli@nvidia.com>

isorT

f1a57bb

Signed-off-by: Jason <jasoli@nvidia.com>

update DDP call

dabaea2

Signed-off-by: Jason <jasoli@nvidia.com>

update unittests

4eb90b5

Signed-off-by: Jason <jasoli@nvidia.com>

blisc added 2 commits February 13, 2020 13:45

Merge remote-tracking branch 'nvidia/master' into u_switch_apex_ddp_t…

9fc14d7

…o_torch Signed-off-by: Jason <jasoli@nvidia.com>

make VoidType always return SAME; update tie weights test

a71da02

Signed-off-by: Jason <jasoli@nvidia.com>

blisc added 2 commits February 13, 2020 14:52

fix nlp examples; do right typing

74ad50c

Signed-off-by: Jason <jasoli@nvidia.com>

isort

390c410

Signed-off-by: Jason <jasoli@nvidia.com>

small update

8c26247

Signed-off-by: Jason <jasoli@nvidia.com>

okuchaiev approved these changes Feb 14, 2020

View reviewed changes

okuchaiev merged commit 3e04e09 into NVIDIA:master Feb 14, 2020

blisc deleted the u_switch_apex_ddp_to_torch branch February 14, 2020 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch Apex with Pytorch #336

Switch Apex with Pytorch #336

blisc commented Feb 7, 2020 •

edited

Loading

lgtm-com bot commented Feb 7, 2020

lgtm-com bot commented Feb 7, 2020

lgtm-com bot commented Feb 7, 2020

blisc commented Feb 7, 2020 •

edited

Loading

lgtm-com bot commented Feb 7, 2020

lgtm-com bot commented Feb 7, 2020

lgtm-com bot commented Feb 8, 2020

lgtm-com bot commented Feb 8, 2020

lgtm-com bot commented Feb 8, 2020

lgtm-com bot commented Feb 13, 2020

lgtm-com bot commented Feb 13, 2020

lgtm-com bot commented Feb 13, 2020

lgtm-com bot commented Feb 13, 2020

lgtm-com bot commented Feb 13, 2020

Switch Apex with Pytorch #336

Switch Apex with Pytorch #336

Conversation

blisc commented Feb 7, 2020 • edited Loading

lgtm-com bot commented Feb 7, 2020

lgtm-com bot commented Feb 7, 2020

lgtm-com bot commented Feb 7, 2020

blisc commented Feb 7, 2020 • edited Loading

lgtm-com bot commented Feb 7, 2020

lgtm-com bot commented Feb 7, 2020

lgtm-com bot commented Feb 8, 2020

lgtm-com bot commented Feb 8, 2020

lgtm-com bot commented Feb 8, 2020

lgtm-com bot commented Feb 13, 2020

lgtm-com bot commented Feb 13, 2020

lgtm-com bot commented Feb 13, 2020

lgtm-com bot commented Feb 13, 2020

lgtm-com bot commented Feb 13, 2020

blisc commented Feb 7, 2020 •

edited

Loading

blisc commented Feb 7, 2020 •

edited

Loading