Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[Usability] Unify BERT horovod and kvstore pre-training script #889

Merged
merged 32 commits into from
Sep 24, 2019

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Aug 21, 2019

Description

  • changed per GPU batch size option to total batch size option
  • add comm_backend option which supports horovod and kvstore (to add byteps support soon)
  • changed short_seq_prob default value to 0

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@codecov
Copy link

codecov bot commented Aug 21, 2019

Codecov Report

❗ No coverage uploaded for pull request head (unify@9078e9f). Click here to learn what that means.
The diff coverage is n/a.

@codecov
Copy link

codecov bot commented Aug 21, 2019

Codecov Report

Merging #889 into master will increase coverage by 0.29%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #889      +/-   ##
==========================================
+ Coverage   89.95%   90.24%   +0.29%     
==========================================
  Files          67       67              
  Lines        6360     6438      +78     
==========================================
+ Hits         5721     5810      +89     
+ Misses        639      628      -11
Impacted Files Coverage Δ
src/gluonnlp/data/sampler.py 96.5% <100%> (+0.01%) ⬆️
src/gluonnlp/model/bert.py 84.95% <0%> (-14.51%) ⬇️
src/gluonnlp/model/train/language_model.py 95.33% <0%> (-1.83%) ⬇️
src/gluonnlp/model/language_model.py 98.54% <0%> (-1.46%) ⬇️
src/gluonnlp/data/transforms.py 81.45% <0%> (-0.15%) ⬇️
src/gluonnlp/data/stream.py 84.97% <0%> (ø) ⬆️
src/gluonnlp/model/train/cache.py 97.82% <0%> (+0.2%) ⬆️
src/gluonnlp/data/utils.py 76.33% <0%> (+2.29%) ⬆️
src/gluonnlp/model/sequence_sampler.py 91.63% <0%> (+17.07%) ⬆️

EC2 Default User added 2 commits August 21, 2019 05:19
@mli
Copy link
Member

mli commented Aug 21, 2019

Job PR-889/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/1/index.html

@mli
Copy link
Member

mli commented Aug 21, 2019

Job PR-889/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/2/index.html

@mli
Copy link
Member

mli commented Aug 21, 2019

Job PR-889/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/3/index.html

@mli
Copy link
Member

mli commented Aug 21, 2019

Job PR-889/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/4/index.html

@mli
Copy link
Member

mli commented Aug 21, 2019

Job PR-889/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/5/index.html

@mli
Copy link
Member

mli commented Sep 7, 2019

Job PR-889/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/7/index.html

next_sentence_label, segment_id, valid_length)
classified, decoded, ls1, ls2 = out
ls = ls1 + ls2
ls = ls / args.accumulate
if args.dtype == 'float16':
self._trainer.backward(ls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is a little bit strange here. Is it true that when args.dtype == 'float16', _trainer is always expected? If so, maybe write if _trainer is not None: ?

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also change the definition of --batch_size to be consistent with the one in Bert paper? Currently, it is batch_size * seq_length, which is very confusing for users. Since there is alreay max_seq_length in the arguments, would this multiplied with seq_length still needed here?

@mli
Copy link
Member

mli commented Sep 15, 2019

Job PR-889/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/8/index.html

@mli
Copy link
Member

mli commented Sep 22, 2019

Job PR-889/13 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/13/index.html

@mli
Copy link
Member

mli commented Sep 23, 2019

Job PR-889/15 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/15/index.html

@mli
Copy link
Member

mli commented Sep 23, 2019

Job PR-889/16 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/16/index.html

@eric-haibin-lin eric-haibin-lin changed the title [WIP] Unify BERT horovod and kvstore pre-training script [Refactor] Unify BERT horovod and kvstore pre-training script Sep 23, 2019
@eric-haibin-lin eric-haibin-lin changed the title [Refactor] Unify BERT horovod and kvstore pre-training script [Usability] Unify BERT horovod and kvstore pre-training script Sep 23, 2019
@eric-haibin-lin
Copy link
Member Author

@szha @leezu Why does renaming the title trigger CI?

@mli
Copy link
Member

mli commented Sep 23, 2019

Job PR-889/17 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/17/index.html

@leezu
Copy link
Contributor

leezu commented Sep 23, 2019

It shouldn't, as the PR number (889) doesn't change. None of the recent commits at https://github.com/jenkinsci/github-branch-source-plugin/commits/master seems to (intentionally) introduce this behaviour so I am unsure why it happened.

# Test only if horovod is present
import horovod.mxnet as hvd
except ImportError:
print("The test expects master branch of MXNet and Horovod. Skipped now.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we execute this test on our CI? The time-stamps / execution time for both master-gpu-integration and gpu-integration suggest that it is skipped for both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test with 'device' option should be execute (it's now appearing in the list of slowest tests). I tried to add horovod dependency on CI (https://github.com/dmlc/gluon-nlp/pull/775/files) but I failed with multipled attempts. Unfortunately the log is gone and I cannot find the exact error msg..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the try / except block given that horovod is now working on CI? (based on your latest commit and cuda upgrade on CI)

@mli
Copy link
Member

mli commented Sep 24, 2019

Job PR-889/18 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/18/index.html

@mli
Copy link
Member

mli commented Sep 24, 2019

Job PR-889/19 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/19/index.html

@mli
Copy link
Member

mli commented Sep 24, 2019

Job PR-889/20 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/20/index.html

@mli
Copy link
Member

mli commented Sep 24, 2019

Job PR-889/21 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/21/index.html

@mli
Copy link
Member

mli commented Sep 24, 2019

Job PR-889/22 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/22/index.html

@mli
Copy link
Member

mli commented Sep 24, 2019

Job PR-889/23 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/23/index.html

@mli
Copy link
Member

mli commented Sep 24, 2019

Job PR-889/24 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/24/index.html

@eric-haibin-lin eric-haibin-lin merged commit 6e4ae87 into dmlc:master Sep 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants