Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Apex with Pytorch #336

Merged
merged 19 commits into from
Feb 14, 2020
Merged

Conversation

blisc
Copy link
Collaborator

@blisc blisc commented Feb 7, 2020

  • Switch DistributedDataParallel
  • Switch SyncBatchNorm
  • Address gradient_predivide_factor
  • Address gradient accumulation

This PR switches from apex's DistributedDataParallel to torch's DistributedDataParallel.
Warning: that gradient_predivide_factor is no longer working after this switch
Warning: in multi-gpu runs, neural modules with no weights MUST inherit from NonTrainableNM.

Signed-off-by: Jason <jasoli@nvidia.com>
…o_torch

Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 7, 2020

This pull request fixes 1 alert when merging 675b0fe into 18b528e - view on LGTM.com

fixed alerts:

  • 1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 7, 2020

This pull request fixes 1 alert when merging d535720 into 18b528e - view on LGTM.com

fixed alerts:

  • 1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 7, 2020

This pull request fixes 1 alert when merging a9be01a into 18b528e - view on LGTM.com

fixed alerts:

  • 1 for Unused import

@blisc
Copy link
Collaborator Author

blisc commented Feb 7, 2020

For some reason, memory usage shoots up from 12GB -> 20GB, need to find out why
Supposedly not an issue, memory is allocated but unused.
It could be worked around via the following at the end of _eval() but I don't think it's necessary:

if self.global_rank == 0:
    del values_dict
del registered_e_tensors
torch.cuda.empty_cache()

@lgtm-com
Copy link

lgtm-com bot commented Feb 7, 2020

This pull request introduces 1 alert and fixes 1 when merging 71d4bff into 4f299f4 - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Feb 7, 2020

This pull request introduces 1 alert and fixes 1 when merging 64ccf26 into c6a3cdd - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>
@blisc blisc changed the title [WIP] Switch Apex with Pytorch Switch Apex with Pytorch Feb 8, 2020
@blisc blisc marked this pull request as ready for review February 8, 2020 00:33
@lgtm-com
Copy link

lgtm-com bot commented Feb 8, 2020

This pull request introduces 1 alert and fixes 1 when merging 7235317 into c6a3cdd - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 8, 2020

This pull request fixes 1 alert when merging f627f44 into 54a8e9e - view on LGTM.com

fixed alerts:

  • 1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 8, 2020

This pull request fixes 1 alert when merging b85cb40 into 54a8e9e - view on LGTM.com

fixed alerts:

  • 1 for Unused import

…o_torch

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 13, 2020

This pull request fixes 1 alert when merging f1a57bb into 403238f - view on LGTM.com

fixed alerts:

  • 1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 13, 2020

This pull request fixes 1 alert when merging dabaea2 into 403238f - view on LGTM.com

fixed alerts:

  • 1 for Unused import

Signed-off-by: Jason <jasoli@nvidia.com>
…o_torch

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 13, 2020

This pull request fixes 2 alerts when merging a71da02 into f072029 - view on LGTM.com

fixed alerts:

  • 1 for Unused import
  • 1 for Wrong number of arguments in a class instantiation

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 13, 2020

This pull request fixes 2 alerts when merging 390c410 into f072029 - view on LGTM.com

fixed alerts:

  • 1 for Unused import
  • 1 for Wrong number of arguments in a class instantiation

Signed-off-by: Jason <jasoli@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Feb 13, 2020

This pull request fixes 2 alerts when merging 8c26247 into f072029 - view on LGTM.com

fixed alerts:

  • 1 for Unused import
  • 1 for Wrong number of arguments in a class instantiation

@okuchaiev okuchaiev merged commit 3e04e09 into NVIDIA:master Feb 14, 2020
@blisc blisc deleted the u_switch_apex_ddp_to_torch branch February 14, 2020 21:16
dcurran90 pushed a commit to dcurran90/NeMo that referenced this pull request Oct 15, 2024
* cli: use non-zero exit status for error scenarios

Invoking sys.exit() with no args results in an exit status
of zero, which traditionally indicates success. This is not
appropriate for error scenarios.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

* lab: exit program after catching and reporting exceptions

Creating the app failed with an exception raised due to a missing
model file. Execution carried on, however, leading to use of an
undefined 'app' variable. Fix this, and a couple of other places
which also caught (unrecoverable) exceptions and forgot to exit.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

---------

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants