Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

FEATURE Bert fp16 norm: multi-tensor sum_sq #1115

Merged
merged 4 commits into from
Jan 21, 2020

Conversation

MoisesHer
Copy link
Contributor

@MoisesHer MoisesHer commented Jan 15, 2020

Description

Make use of multi-tensor sum of squares during the computation of the norm in bert training.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Instead of computing the dot product of each tensor sequentially, for later compute the sum of squares, in this PR we use the Mxnet operator multi_sum_sq, which computes the dot product of several tensors in parallel.
    If multiple contexts/devices are available, the tensors are distributed among them, and later they are reduced into a single scalar.
  • Added test in test/test_utils.py (test_grad_global_norm)
  • Modified clip_grad_global_norm to use grad_global_norm

Comments

@MoisesHer MoisesHer requested a review from a team as a code owner January 15, 2020 21:58
@codecov
Copy link

codecov bot commented Jan 15, 2020

Codecov Report

Merging #1115 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1115      +/-   ##
==========================================
- Coverage   88.25%   88.21%   -0.05%     
==========================================
  Files          67       66       -1     
  Lines        6275     6312      +37     
==========================================
+ Hits         5538     5568      +30     
- Misses        737      744       +7
Impacted Files Coverage Δ
src/gluonnlp/utils/parameter.py 87.09% <100%> (+3.16%) ⬆️
src/gluonnlp/utils/files.py 42.62% <0%> (-6.4%) ⬇️
src/gluonnlp/optimizer/bert_adam.py 87.32% <0%> (-5.86%) ⬇️
src/gluonnlp/model/train/language_model.py 88.51% <0%> (-5.27%) ⬇️
src/gluonnlp/optimizer/__init__.py 100% <0%> (ø) ⬆️
src/gluonnlp/utils/version.py 100% <0%> (ø) ⬆️
src/gluonnlp/optimizer/lamb.py
src/gluonnlp/base.py 89.65% <0%> (+3.44%) ⬆️
src/gluonnlp/data/utils.py 86.39% <0%> (+12.34%) ⬆️

@mli
Copy link
Member

mli commented Jan 15, 2020

Job PR-1115/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1115/1/index.html

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Could you also update https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/utils.py#L118 clip_global_norm in mxnet with the multi square sum op? Thanks

@eric-haibin-lin eric-haibin-lin added the release focus Progress focus for release label Jan 15, 2020
@leezu
Copy link
Contributor

leezu commented Jan 16, 2020

@MoisesHer please fix lint

@mli
Copy link
Member

mli commented Jan 16, 2020

Job PR-1115/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1115/2/index.html

@mli
Copy link
Member

mli commented Jan 16, 2020

Job PR-1115/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1115/3/index.html

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment. otherwise looks good to me

src/gluonnlp/utils/parameter.py Outdated Show resolved Hide resolved
@mli
Copy link
Member

mli commented Jan 20, 2020

Job PR-1115/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1115/4/index.html

@eric-haibin-lin eric-haibin-lin merged commit f19ace9 into dmlc:master Jan 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants