[Enhancement] BERT pre-training data generation from sentencepiece vocab #743

eric-haibin-lin · 2019-06-01T23:17:10Z

Description

add sentencepiece vocabulary support
addresses race condition when horovod is used #726
more documentation
remove layer norm fp32 casting
support comma in dataset stream API

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

codecov · 2019-06-01T23:17:12Z

Codecov Report

❗ No coverage uploaded for pull request head (raw@abc4ab7). Click here to learn what that means.
The diff coverage is n/a.

codecov · 2019-06-01T23:17:12Z

Codecov Report

Merging #743 into master will decrease coverage by 0.09%.
The diff coverage is 66.66%.

@@            Coverage Diff            @@
##           master     #743     +/-   ##
=========================================
- Coverage   90.58%   90.49%   -0.1%     
=========================================
  Files          66       66             
  Lines        6121     6120      -1     
=========================================
- Hits         5545     5538      -7     
- Misses        576      582      +6

Impacted Files	Coverage Δ
src/gluonnlp/model/block.py	`51.92% <ø> (ø)`	⬆️
src/gluonnlp/data/transforms.py	`78.18% <ø> (ø)`	⬆️
src/gluonnlp/model/utils.py	`76.72% <0%> (ø)`	⬆️
src/gluonnlp/data/stream.py	`89.61% <100%> (+0.11%)`	⬆️
src/gluonnlp/model/bert.py	`99.27% <50%> (+2.63%)`	⬆️
src/gluonnlp/data/utils.py	`74.82% <63.63%> (-1.43%)`	⬇️
src/gluonnlp/data/dataloader.py	`83.62% <0%> (-5.18%)`	⬇️

mli · 2019-06-02T00:09:27Z

Job PR-743/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/1/index.html

eric-haibin-lin · 2019-06-02T00:57:42Z

@davisliang FYI

mli · 2019-06-02T01:24:22Z

Job PR-743/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/3/index.html

mli · 2019-06-02T02:01:36Z

Job PR-743/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/4/index.html

mli · 2019-06-02T19:03:52Z

Job PR-743/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/7/index.html

mli · 2019-06-02T20:05:05Z

Job PR-743/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/8/index.html

mli · 2019-06-03T00:42:22Z

Job PR-743/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/9/index.html

mli · 2019-06-03T01:46:59Z

Job PR-743/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/10/index.html

mli · 2019-06-03T20:42:53Z

Job PR-743/11 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/11/index.html

paperplanet

Nice work! Some comments.

scripts/bert/run_pretraining_hvd.py

scripts/bert/index.rst

scripts/bert/create_pretraining_data.py

mli · 2019-06-04T23:30:33Z

Job PR-743/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/12/index.html

mli · 2019-06-05T03:24:59Z

Job PR-743/13 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/13/index.html

scripts/bert/index.rst

paperplanet

Nice work! Looks good to me!

leezu · 2019-06-05T15:25:11Z

scripts/bert/index.rst

+
+Run pre-training with horovod on node0 and node1, with 8 GPUs each:
+
+    $ mpirun -np 16 -H node0:8,node1:8 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket -x NCCL_MIN_NRINGS=8 -x NCCL_DEBUG=WARNING -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 --tag-output python run_pretraining_hvd.py --batch_size 8192 --accumulate 1 --lr 1e-4 --data "/path/to/generated/samples/train/*.npz" --warmup_ratio 0.01 --num_steps 1000000 --log_interval=250 --ckpt_dir './ckpt' --ckpt_interval 25000 --num_buckets 10 --dtype float16 --use_avg_len --verbose


This line is raising warnings when generating the documentations: Inline emphasis start-string without end-string. (warnings are treated as fatal in our CI setup)

Thanks. I split them into two separate blocks with code-block annotations

leezu

Thanks!

mli · 2019-06-05T19:13:20Z

Job PR-743/15 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/15/index.html

…cab (dmlc#743) * enable fp16 for ln. enable gelu * manage processes * manual killl * support sentencepiece * dont register signal handler * add unigram sampling * support comma for npz format * fix file race condition * switch to thread prefetcher * fix pool.apply * revert gelu support * update documentation * code cleanup * avoid file download conflcit * update doc * fix bug * bug fix * more multi-processing * fix lint * update doc and fix lint * fix test argument * remove -test_bert_sentencepiece_sentences_transform() * bug fix * fix lint * fix doc build

eric-haibin-lin and others added 16 commits May 22, 2019 12:11

enable fp16 for ln. enable gelu

071d410

manage processes

f134c01

manual killl

97e4d9a

Merge remote-tracking branch 'origin/master' into raw

1c38510

Merge remote-tracking branch 'haibin/gelu-pr' into raw

f0b0659

support sentencepiece

069a432

dont register signal handler

dd56748

add unigram sampling

e99d090

support comma for npz format

d7a4c05

fix file race condition

f4174ad

switch to thread prefetcher

28bc3e7

fix pool.apply

0abd720

revert gelu support

98cce41

update documentation

053a386

Merge remote-tracking branch 'upstream/master' into raw

3872150

code cleanup

abc4ab7

eric-haibin-lin requested a review from szha as a code owner June 1, 2019 23:17

eric-haibin-lin requested a review from vanewu June 1, 2019 23:23

avoid file download conflcit

7984f9d

EC2 Default User added 3 commits June 2, 2019 00:17

merge

35d42ba

update doc

3795ca9

fix bug

e16ce64

EC2 Default User added 2 commits June 2, 2019 04:57

bug fix

9d5a911

more multi-processing

faad804

eric-haibin-lin added 2 commits June 2, 2019 16:03

fix test argument

7b4ab2c

remove -test_bert_sentencepiece_sentences_transform()

ca254a7

paperplanet reviewed Jun 4, 2019

View reviewed changes

scripts/bert/run_pretraining_hvd.py Show resolved Hide resolved

scripts/bert/index.rst Outdated Show resolved Hide resolved

scripts/bert/create_pretraining_data.py Outdated Show resolved Hide resolved

eric-haibin-lin mentioned this pull request Jun 4, 2019

[Enhancement] Enable fp16 layernorm and fused GELU #723

Closed

6 tasks

bug fix

d926de2

fix lint

597dff3

szha added the release focus Progress focus for release label Jun 5, 2019

eric-haibin-lin requested review from haven-jeon, hankcs and leezu June 5, 2019 04:46

szha reviewed Jun 5, 2019

View reviewed changes

scripts/bert/index.rst Outdated Show resolved Hide resolved

szha approved these changes Jun 5, 2019

View reviewed changes

fix conflict

f2b963d

paperplanet approved these changes Jun 5, 2019

View reviewed changes

leezu reviewed Jun 5, 2019

View reviewed changes

leezu approved these changes Jun 5, 2019

View reviewed changes

fix doc build

da1dfd0

eric-haibin-lin merged commit 4f89ca0 into dmlc:master Jun 5, 2019

eric-haibin-lin deleted the raw branch October 12, 2019 00:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] BERT pre-training data generation from sentencepiece vocab #743

[Enhancement] BERT pre-training data generation from sentencepiece vocab #743

eric-haibin-lin commented Jun 1, 2019 •

edited

Loading

codecov bot commented Jun 1, 2019

codecov bot commented Jun 1, 2019 •

edited

Loading

mli commented Jun 2, 2019

eric-haibin-lin commented Jun 2, 2019

mli commented Jun 2, 2019

mli commented Jun 2, 2019

mli commented Jun 2, 2019

mli commented Jun 2, 2019

mli commented Jun 3, 2019

mli commented Jun 3, 2019

mli commented Jun 3, 2019

paperplanet left a comment

mli commented Jun 4, 2019

mli commented Jun 5, 2019

paperplanet left a comment

leezu Jun 5, 2019

eric-haibin-lin Jun 5, 2019

leezu left a comment

mli commented Jun 5, 2019


		Run pre-training with horovod on node0 and node1, with 8 GPUs each:

		$ mpirun -np 16 -H node0:8,node1:8 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket -x NCCL_MIN_NRINGS=8 -x NCCL_DEBUG=WARNING -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 --tag-output python run_pretraining_hvd.py --batch_size 8192 --accumulate 1 --lr 1e-4 --data "/path/to/generated/samples/train/*.npz" --warmup_ratio 0.01 --num_steps 1000000 --log_interval=250 --ckpt_dir './ckpt' --ckpt_interval 25000 --num_buckets 10 --dtype float16 --use_avg_len --verbose

[Enhancement] BERT pre-training data generation from sentencepiece vocab #743

[Enhancement] BERT pre-training data generation from sentencepiece vocab #743

Conversation

eric-haibin-lin commented Jun 1, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

codecov bot commented Jun 1, 2019

Codecov Report

codecov bot commented Jun 1, 2019 • edited Loading

Codecov Report

mli commented Jun 2, 2019

eric-haibin-lin commented Jun 2, 2019

mli commented Jun 2, 2019

mli commented Jun 2, 2019

mli commented Jun 2, 2019

mli commented Jun 2, 2019

mli commented Jun 3, 2019

mli commented Jun 3, 2019

mli commented Jun 3, 2019

paperplanet left a comment

Choose a reason for hiding this comment

mli commented Jun 4, 2019

mli commented Jun 5, 2019

paperplanet left a comment

Choose a reason for hiding this comment

leezu Jun 5, 2019

Choose a reason for hiding this comment

eric-haibin-lin Jun 5, 2019

Choose a reason for hiding this comment

leezu left a comment

Choose a reason for hiding this comment

mli commented Jun 5, 2019

eric-haibin-lin commented Jun 1, 2019 •

edited

Loading

codecov bot commented Jun 1, 2019 •

edited

Loading