[Usability] Unify BERT horovod and kvstore pre-training script #889

eric-haibin-lin · 2019-08-21T05:13:17Z

Description

changed per GPU batch size option to total batch size option
add comm_backend option which supports horovod and kvstore (to add byteps support soon)
changed short_seq_prob default value to 0

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

codecov · 2019-08-21T05:13:19Z

Codecov Report

❗ No coverage uploaded for pull request head (unify@9078e9f). Click here to learn what that means.
The diff coverage is n/a.

codecov · 2019-08-21T05:13:19Z

Codecov Report

Merging #889 into master will increase coverage by 0.29%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #889      +/-   ##
==========================================
+ Coverage   89.95%   90.24%   +0.29%     
==========================================
  Files          67       67              
  Lines        6360     6438      +78     
==========================================
+ Hits         5721     5810      +89     
+ Misses        639      628      -11

Impacted Files	Coverage Δ
src/gluonnlp/data/sampler.py	`96.5% <100%> (+0.01%)`	⬆️
src/gluonnlp/model/bert.py	`84.95% <0%> (-14.51%)`	⬇️
src/gluonnlp/model/train/language_model.py	`95.33% <0%> (-1.83%)`	⬇️
src/gluonnlp/model/language_model.py	`98.54% <0%> (-1.46%)`	⬇️
src/gluonnlp/data/transforms.py	`81.45% <0%> (-0.15%)`	⬇️
src/gluonnlp/data/stream.py	`84.97% <0%> (ø)`	⬆️
src/gluonnlp/model/train/cache.py	`97.82% <0%> (+0.2%)`	⬆️
src/gluonnlp/data/utils.py	`76.33% <0%> (+2.29%)`	⬆️
src/gluonnlp/model/sequence_sampler.py	`91.63% <0%> (+17.07%)`	⬆️

mli · 2019-08-21T05:48:54Z

Job PR-889/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/1/index.html

mli · 2019-08-21T05:55:56Z

Job PR-889/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/2/index.html

mli · 2019-08-21T06:15:56Z

Job PR-889/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/3/index.html

mli · 2019-08-21T06:29:05Z

Job PR-889/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/4/index.html

mli · 2019-08-21T07:12:56Z

Job PR-889/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/5/index.html

mli · 2019-09-07T00:45:28Z

Job PR-889/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/7/index.html

scripts/bert/run_pretraining.py

apeforest · 2019-09-09T22:27:28Z

scripts/bert/run_pretraining.py

+                              next_sentence_label, segment_id, valid_length)
+            classified, decoded, ls1, ls2 = out
+            ls = ls1 + ls2
+            ls = ls / args.accumulate
        if args.dtype == 'float16':
            self._trainer.backward(ls)


This logic is a little bit strange here. Is it true that when args.dtype == 'float16', _trainer is always expected? If so, maybe write if _trainer is not None: ?

scripts/bert/run_pretraining.py

apeforest

Can we also change the definition of --batch_size to be consistent with the one in Bert paper? Currently, it is batch_size * seq_length, which is very confusing for users. Since there is alreay max_seq_length in the arguments, would this multiplied with seq_length still needed here?

scripts/bert/run_pretraining.py

mli · 2019-09-15T04:56:33Z

Job PR-889/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/8/index.html

mli · 2019-09-22T22:31:26Z

Job PR-889/13 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/13/index.html

mli · 2019-09-23T00:40:33Z

Job PR-889/15 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/15/index.html

mli · 2019-09-23T01:15:34Z

Job PR-889/16 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/16/index.html

eric-haibin-lin · 2019-09-23T05:46:44Z

@szha @leezu Why does renaming the title trigger CI?

mli · 2019-09-23T05:47:05Z

Job PR-889/17 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/17/index.html

leezu · 2019-09-23T09:03:06Z

It shouldn't, as the PR number (889) doesn't change. None of the recent commits at https://github.com/jenkinsci/github-branch-source-plugin/commits/master seems to (intentionally) introduce this behaviour so I am unsure why it happened.

leezu · 2019-09-23T09:16:59Z

scripts/tests/test_scripts.py

+            # Test only if horovod is present
+            import horovod.mxnet as hvd
+        except ImportError:
+            print("The test expects master branch of MXNet and Horovod. Skipped now.")


Can we execute this test on our CI? The time-stamps / execution time for both master-gpu-integration and gpu-integration suggest that it is skipped for both.

The test with 'device' option should be execute (it's now appearing in the list of slowest tests). I tried to add horovod dependency on CI (https://github.com/dmlc/gluon-nlp/pull/775/files) but I failed with multipled attempts. Unfortunately the log is gone and I cannot find the exact error msg..

Can you remove the try / except block given that horovod is now working on CI? (based on your latest commit and cuda upgrade on CI)

mli · 2019-09-24T02:06:58Z

Job PR-889/18 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/18/index.html

mli · 2019-09-24T07:00:30Z

Job PR-889/19 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/19/index.html

mli · 2019-09-24T18:26:55Z

Job PR-889/20 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/20/index.html

mli · 2019-09-24T19:21:04Z

Job PR-889/21 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/21/index.html

mli · 2019-09-24T19:54:54Z

Job PR-889/22 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/22/index.html

This reverts commit 7a01ad8.

This reverts commit d4356c6.

mli · 2019-09-24T21:06:08Z

Job PR-889/23 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/23/index.html

mli · 2019-09-24T21:36:01Z

Job PR-889/24 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-889/24/index.html

EC2 Default User added 2 commits August 21, 2019 04:04

merge two scripts

10e5eed

reduce model code

9078e9f

eric-haibin-lin requested a review from szha as a code owner August 21, 2019 05:13

EC2 Default User added 2 commits August 21, 2019 05:19

fix lint

81bf5c2

rename

9f13271

fix norm clip scale

3a5700e

fix test

63a06b9

Ubuntu and others added 2 commits September 1, 2019 18:19

remove try catch

431a031

Update run_pretraining.py

c0d4ff3

apeforest reviewed Sep 9, 2019

View reviewed changes

scripts/bert/run_pretraining.py Outdated Show resolved Hide resolved

apeforest reviewed Sep 9, 2019

View reviewed changes

apeforest reviewed Sep 10, 2019

View reviewed changes

scripts/bert/run_pretraining.py Outdated Show resolved Hide resolved

apeforest reviewed Sep 10, 2019

View reviewed changes

scripts/bert/run_pretraining.py Outdated Show resolved Hide resolved

fix bug

426579d

Ubuntu and others added 6 commits September 15, 2019 17:01

slightly more efficient version of lamb

cdc4b11

Update lamb.py

c8cbb87

Update fp16_utils.py

38e1587

Update lamb.py

13c4706

Update lamb.py

4151b1e

further refactoring

f304b58

Ubuntu added 3 commits September 22, 2019 23:47

fix import

9f05ca4

add missing files

57fd4e5

fix lint

903739b

eric-haibin-lin changed the title ~~[WIP] Unify BERT horovod and kvstore pre-training script~~ [Refactor] Unify BERT horovod and kvstore pre-training script Sep 23, 2019

eric-haibin-lin changed the title ~~[Refactor] Unify BERT horovod and kvstore pre-training script~~ [Usability] Unify BERT horovod and kvstore pre-training script Sep 23, 2019

leezu reviewed Sep 23, 2019

View reviewed changes

add hvd dependency

f09de12

Update test_scripts.py

ae7f095

leezu approved these changes Sep 24, 2019

View reviewed changes

Merge remote-tracking branch 'origin/better-lamb' into unify

6c2520c

rename dummy data option. Add option for no acc compute

7184481

add prefetcher

ed8c635

eric-haibin-lin added 4 commits September 24, 2019 20:27

revert lamb changes

7a01ad8

gRevert "revert lamb changes"

d4356c6

This reverts commit 7a01ad8.

fix args.dummy_data_len usage

8c4c758

Revert "gRevert "revert lamb changes""

bb07f5f

This reverts commit d4356c6.

eric-haibin-lin merged commit 6e4ae87 into dmlc:master Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usability] Unify BERT horovod and kvstore pre-training script #889

[Usability] Unify BERT horovod and kvstore pre-training script #889

eric-haibin-lin commented Aug 21, 2019 •

edited

Loading

codecov bot commented Aug 21, 2019

codecov bot commented Aug 21, 2019 •

edited

Loading

mli commented Aug 21, 2019

mli commented Aug 21, 2019

mli commented Aug 21, 2019

mli commented Aug 21, 2019

mli commented Aug 21, 2019

mli commented Sep 7, 2019

apeforest Sep 9, 2019

apeforest left a comment

mli commented Sep 15, 2019

mli commented Sep 22, 2019

mli commented Sep 23, 2019

mli commented Sep 23, 2019

eric-haibin-lin commented Sep 23, 2019

mli commented Sep 23, 2019

leezu commented Sep 23, 2019

leezu Sep 23, 2019

eric-haibin-lin Sep 23, 2019

leezu Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

[Usability] Unify BERT horovod and kvstore pre-training script #889

[Usability] Unify BERT horovod and kvstore pre-training script #889

Conversation

eric-haibin-lin commented Aug 21, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

codecov bot commented Aug 21, 2019

Codecov Report

codecov bot commented Aug 21, 2019 • edited Loading

Codecov Report

mli commented Aug 21, 2019

mli commented Aug 21, 2019

mli commented Aug 21, 2019

mli commented Aug 21, 2019

mli commented Aug 21, 2019

mli commented Sep 7, 2019

apeforest Sep 9, 2019

Choose a reason for hiding this comment

apeforest left a comment

Choose a reason for hiding this comment

mli commented Sep 15, 2019

mli commented Sep 22, 2019

mli commented Sep 23, 2019

mli commented Sep 23, 2019

eric-haibin-lin commented Sep 23, 2019

mli commented Sep 23, 2019

leezu commented Sep 23, 2019

leezu Sep 23, 2019

Choose a reason for hiding this comment

eric-haibin-lin Sep 23, 2019

Choose a reason for hiding this comment

leezu Sep 24, 2019

Choose a reason for hiding this comment

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

mli commented Sep 24, 2019

eric-haibin-lin commented Aug 21, 2019 •

edited

Loading

codecov bot commented Aug 21, 2019 •

edited

Loading