-
Notifications
You must be signed in to change notification settings - Fork 538
Add BERTTokenizer for pre-trained SentencePiece model #669
Conversation
Job PR-669/1 is complete. |
Codecov Report
@@ Coverage Diff @@
## master #669 +/- ##
==========================================
- Coverage 63.74% 63.68% -0.06%
==========================================
Files 141 141
Lines 12902 12922 +20
==========================================
+ Hits 8224 8229 +5
- Misses 4678 4693 +15
|
Codecov Report
@@ Coverage Diff @@
## master #669 +/- ##
==========================================
+ Coverage 90.53% 90.65% +0.11%
==========================================
Files 65 65
Lines 5991 6067 +76
==========================================
+ Hits 5424 5500 +76
Misses 567 567
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch. How much does it differ from our sentence piece tokenizer? (I just realized that this class wasn't shown in the data API doc.)
To use in BERTSentenceTransform I suggested this class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Would you mind adding an unit test for this?
def _tokenize_wordpiece(self, text): | ||
"""Tokenizes a piece of text into its word pieces. | ||
|
||
This use Google's SentencePiece tokenizer model file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use
-> This uses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Would you mind adding an unit test for this?
I am working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-haibin-lin How can I upload test sentencepiece model for unittest like test_sentencepiece_tokenizer_subword_regularization ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haven-jeon I can help with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I uploaded the file to apache-mxnet/gluon/dataset/vocab/test-682b5d15.bpe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I am working.
Job PR-669/6 is complete. |
In GluonNLP-py2-cpu-unittest:
|
Needs to add unicode convertion function can work both Py2, Py3. |
Would it work if you just change the test case to: |
It still fail because Vocab token is not unicode on PY2. It needs to check vocab tokens. BTW, Could you give me guide to run unite test on local? |
The steps of You will need anaconda or miniconda for creating the environment. |
I see. Then maybe we need to add some util.to_unicode():
and apply this to the output of sp_tokenizer.tokens? |
Thanks.
Yes, I'm applying unicode convertion function to vocab tokens. |
Yes, there's the corresponding Jenkins file for it too. |
Job PR-669/7 is complete. |
Job PR-669/8 is complete. |
Job PR-669/9 is complete. |
Job PR-669/10 is complete. |
@haven-jeon I made some changes to make sure the vocab is consistent with the one from the sentencepiece file. Please let me know if you think the change is reasonable. |
Job PR-669/12 is complete. |
@eric-haibin-lin thanks; I was not aware this PR is blocked by #393. Due to gluon-nlp/src/gluonnlp/_constants.py Lines 33 to 35 in c80b771
we need to first regenerate some of our files on S3. I'm working on it. |
unicode string | ||
""" | ||
py_version = sys.version_info[0] | ||
if py_version == 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python version won't change during the process, so the version check can be moved outside the function so that the check only happens once.
Job PR-669/13 is complete. |
# segment id | ||
assert all(processed[2] == np.array([0] * max_len, dtype='int32')) | ||
|
||
test_bert_sentencepiece_sentences_transform() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this call shouldn't have been added as it makes the pytest marker not function properly in some cases. I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do
* add BERTSPTokenizer * update doc * fix doc * add BERTVocab.from_sentencepiece() add test case * fix doctest * fix lint error * fix doctest and lint error * update for unicode processing(py2) * remove comments * disable lint error * fix flake8 error * make vocab consistent * remove six. make convert2unicode private
Description
#653
Sample code to use.
Checklist
Essentials
Changes
Comments