-
Notifications
You must be signed in to change notification settings - Fork 538
[Usability] Unify BERT horovod and kvstore pre-training script #889
Conversation
Codecov Report
|
Codecov Report
@@ Coverage Diff @@
## master #889 +/- ##
==========================================
+ Coverage 89.95% 90.24% +0.29%
==========================================
Files 67 67
Lines 6360 6438 +78
==========================================
+ Hits 5721 5810 +89
+ Misses 639 628 -11
|
Job PR-889/1 is complete. |
Job PR-889/2 is complete. |
Job PR-889/3 is complete. |
Job PR-889/4 is complete. |
Job PR-889/5 is complete. |
Job PR-889/7 is complete. |
next_sentence_label, segment_id, valid_length) | ||
classified, decoded, ls1, ls2 = out | ||
ls = ls1 + ls2 | ||
ls = ls / args.accumulate | ||
if args.dtype == 'float16': | ||
self._trainer.backward(ls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is a little bit strange here. Is it true that when args.dtype == 'float16', _trainer is always expected? If so, maybe write if _trainer is not None:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also change the definition of --batch_size to be consistent with the one in Bert paper? Currently, it is batch_size * seq_length, which is very confusing for users. Since there is alreay max_seq_length in the arguments, would this multiplied with seq_length still needed here?
Job PR-889/8 is complete. |
Job PR-889/13 is complete. |
Job PR-889/15 is complete. |
Job PR-889/16 is complete. |
Job PR-889/17 is complete. |
It shouldn't, as the PR number (889) doesn't change. None of the recent commits at https://github.com/jenkinsci/github-branch-source-plugin/commits/master seems to (intentionally) introduce this behaviour so I am unsure why it happened. |
scripts/tests/test_scripts.py
Outdated
# Test only if horovod is present | ||
import horovod.mxnet as hvd | ||
except ImportError: | ||
print("The test expects master branch of MXNet and Horovod. Skipped now.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we execute this test on our CI? The time-stamps / execution time for both master-gpu-integration
and gpu-integration
suggest that it is skipped for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test with 'device' option should be execute (it's now appearing in the list of slowest tests). I tried to add horovod dependency on CI (https://github.com/dmlc/gluon-nlp/pull/775/files) but I failed with multipled attempts. Unfortunately the log is gone and I cannot find the exact error msg..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the try / except block given that horovod is now working on CI? (based on your latest commit and cuda upgrade on CI)
Job PR-889/18 is complete. |
Job PR-889/19 is complete. |
Job PR-889/20 is complete. |
Job PR-889/21 is complete. |
Job PR-889/22 is complete. |
Job PR-889/23 is complete. |
Job PR-889/24 is complete. |
Description
Checklist
Essentials
Changes
Comments