Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[Enhancement] BERT pre-training data generation from sentencepiece vocab #743

Merged
merged 30 commits into from
Jun 5, 2019

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Jun 1, 2019

Description

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@codecov
Copy link

codecov bot commented Jun 1, 2019

Codecov Report

❗ No coverage uploaded for pull request head (raw@abc4ab7). Click here to learn what that means.
The diff coverage is n/a.

@codecov
Copy link

codecov bot commented Jun 1, 2019

Codecov Report

Merging #743 into master will decrease coverage by 0.09%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #743     +/-   ##
=========================================
- Coverage   90.58%   90.49%   -0.1%     
=========================================
  Files          66       66             
  Lines        6121     6120      -1     
=========================================
- Hits         5545     5538      -7     
- Misses        576      582      +6
Impacted Files Coverage Δ
src/gluonnlp/model/block.py 51.92% <ø> (ø) ⬆️
src/gluonnlp/data/transforms.py 78.18% <ø> (ø) ⬆️
src/gluonnlp/model/utils.py 76.72% <0%> (ø) ⬆️
src/gluonnlp/data/stream.py 89.61% <100%> (+0.11%) ⬆️
src/gluonnlp/model/bert.py 99.27% <50%> (+2.63%) ⬆️
src/gluonnlp/data/utils.py 74.82% <63.63%> (-1.43%) ⬇️
src/gluonnlp/data/dataloader.py 83.62% <0%> (-5.18%) ⬇️

@mli
Copy link
Member

mli commented Jun 2, 2019

Job PR-743/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/1/index.html

@eric-haibin-lin
Copy link
Member Author

@davisliang FYI

@mli
Copy link
Member

mli commented Jun 2, 2019

Job PR-743/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/3/index.html

@mli
Copy link
Member

mli commented Jun 2, 2019

Job PR-743/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/4/index.html

@mli
Copy link
Member

mli commented Jun 2, 2019

Job PR-743/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/7/index.html

@mli
Copy link
Member

mli commented Jun 2, 2019

Job PR-743/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/8/index.html

@mli
Copy link
Member

mli commented Jun 3, 2019

Job PR-743/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/9/index.html

@mli
Copy link
Member

mli commented Jun 3, 2019

Job PR-743/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/10/index.html

@mli
Copy link
Member

mli commented Jun 3, 2019

Job PR-743/11 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/11/index.html

Copy link
Member

@paperplanet paperplanet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Some comments.

scripts/bert/run_pretraining_hvd.py Show resolved Hide resolved
scripts/bert/index.rst Outdated Show resolved Hide resolved
scripts/bert/create_pretraining_data.py Outdated Show resolved Hide resolved
@mli
Copy link
Member

mli commented Jun 4, 2019

Job PR-743/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/12/index.html

@mli
Copy link
Member

mli commented Jun 5, 2019

Job PR-743/13 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/13/index.html

@szha szha added the release focus Progress focus for release label Jun 5, 2019
scripts/bert/index.rst Outdated Show resolved Hide resolved
Copy link
Member

@paperplanet paperplanet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Looks good to me!


Run pre-training with horovod on node0 and node1, with 8 GPUs each:

$ mpirun -np 16 -H node0:8,node1:8 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket -x NCCL_MIN_NRINGS=8 -x NCCL_DEBUG=WARNING -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 --tag-output python run_pretraining_hvd.py --batch_size 8192 --accumulate 1 --lr 1e-4 --data "/path/to/generated/samples/train/*.npz" --warmup_ratio 0.01 --num_steps 1000000 --log_interval=250 --ckpt_dir './ckpt' --ckpt_interval 25000 --num_buckets 10 --dtype float16 --use_avg_len --verbose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is raising warnings when generating the documentations: Inline emphasis start-string without end-string. (warnings are treated as fatal in our CI setup)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I split them into two separate blocks with code-block annotations

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mli
Copy link
Member

mli commented Jun 5, 2019

Job PR-743/15 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-743/15/index.html

@eric-haibin-lin eric-haibin-lin merged commit 4f89ca0 into dmlc:master Jun 5, 2019
paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019
…cab (dmlc#743)

* enable fp16 for ln. enable gelu

* manage processes

* manual killl

* support sentencepiece

* dont register signal handler

* add unigram sampling

* support comma for npz format

* fix file race condition

* switch to thread prefetcher

* fix pool.apply

* revert gelu support

* update documentation

* code cleanup

* avoid file download conflcit

* update doc

* fix bug

* bug fix

* more multi-processing

* fix lint

* update doc and fix lint

* fix test argument

* remove -test_bert_sentencepiece_sentences_transform()

* bug fix

* fix lint

* fix doc build
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants