Add BERTTokenizer for pre-trained SentencePiece model #669

haven-jeon · 2019-04-20T03:03:46Z

Description

BERTTokenizer need to work with SentencePiece tokenizer model, when we build BERT from pre-training.
Swig object(sentencepiece object) should be used in multiprocessing.

Sample code to use.

# use the vocabulary from pre-trained model for tokenization
bert_tokenizer = nlp.data.BERTSPTokenizer("test-0690baed.bpe", vocabulary, lower=True)
# maximum sequence length
max_len = 128
# the labels for the two classes
all_labels = ["0", "1"]
# whether to transform the data as sentence pairs.
# for single sentence classification, set pair=False
pair = True
transform = dataset.BERTDatasetTransform(bert_tokenizer, max_len,
                                         labels=all_labels,
                                         label_dtype='int32',
                                         pad=True,
                                         pair=pair)
data_train = data_train_raw.transform(transform)

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

mli · 2019-04-20T04:08:13Z

Job PR-669/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/1/index.html

codecov · 2019-04-20T04:08:22Z

Codecov Report

Merging #669 into master will decrease coverage by 0.05%.
The diff coverage is 25%.

@@            Coverage Diff             @@
##           master     #669      +/-   ##
==========================================
- Coverage   63.74%   63.68%   -0.06%     
==========================================
  Files         141      141              
  Lines       12902    12922      +20     
==========================================
+ Hits         8224     8229       +5     
- Misses       4678     4693      +15

Flag	Coverage Δ
#PR669	`63.68% <25%> (?)`
#master	`?`
#notserial	`42.6% <25%> (-0.03%)`	⬇️
#py2	`63.43% <25%> (-0.06%)`	⬇️
#py3	`63.56% <25%> (-0.06%)`	⬇️
#serial	`49.18% <25%> (-0.04%)`	⬇️

codecov · 2019-04-20T04:08:23Z

Codecov Report

Merging #669 into master will increase coverage by 0.11%.
The diff coverage is 73.38%.

@@            Coverage Diff             @@
##           master     #669      +/-   ##
==========================================
+ Coverage   90.53%   90.65%   +0.11%     
==========================================
  Files          65       65              
  Lines        5991     6067      +76     
==========================================
+ Hits         5424     5500      +76     
  Misses        567      567

Impacted Files	Coverage Δ
src/gluonnlp/data/transforms.py	`78.18% <47.05%> (+0.82%)`	⬆️
src/gluonnlp/data/utils.py	`78.35% <77.27%> (-1.48%)`	⬇️
src/gluonnlp/vocab/bert.py	`85.04% <98.03%> (+10.04%)`	⬆️
src/gluonnlp/data/dataloader.py	`88.79% <0%> (+5.17%)`	⬆️

szha

Thanks for the patch. How much does it differ from our sentence piece tokenizer? (I just realized that this class wasn't shown in the data API doc.)

haven-jeon · 2019-05-09T01:52:59Z

Thanks for the patch. How much does it differ from our sentence piece tokenizer? (I just realized that this class wasn't shown in the data API doc.)

To use in BERTSentenceTransform I suggested this class.
BERTSentenceTransform requires BERTTokenizer interface.

eric-haibin-lin

Nice work! Would you mind adding an unit test for this?

src/gluonnlp/data/transforms.py

eric-haibin-lin · 2019-05-09T19:04:06Z

src/gluonnlp/data/transforms.py

+    def _tokenize_wordpiece(self, text):
+        """Tokenizes a piece of text into its word pieces.
+
+        This use Google's SentencePiece tokenizer model file


This use -> This uses

Nice work! Would you mind adding an unit test for this?

I am working.

@eric-haibin-lin How can I upload test sentencepiece model for unittest like test_sentencepiece_tokenizer_subword_regularization ?

@haven-jeon I can help with that

I uploaded the file to apache-mxnet/gluon/dataset/vocab/test-682b5d15.bpe

Thanks, I am working.

add test case

…kenizer

mli · 2019-05-15T15:32:30Z

Job PR-669/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/6/index.html

szha · 2019-05-15T16:02:41Z

In GluonNLP-py2-cpu-unittest:

_________________ test_bert_sentencepiece_sentences_transform __________________
[gw1] linux2 -- Python 2.7.15 /var/lib/jenkins/workspace/gluon-nlp-cpu-py2/conda/cpu/py2/bin/python

    @pytest.mark.remote_required
    def test_bert_sentencepiece_sentences_transform():
        url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe'
        f = download(url, overwrite=True)
        bert_vocab = BERTVocab.from_sentencepiece(f)
        bert_tokenizer = t.BERTSPTokenizer(f, bert_vocab, lower=True)
        max_len = 36
        data_train_raw = SimpleDataset(
            [['This is a very awesome, life-changing sentence.']])
        transform = t.BERTSentenceTransform(bert_tokenizer,
                                            max_len,
                                            pad=True,
                                            pair=False)
        try:
            data_train = data_train_raw.transform(transform)
        except ImportError:
            warnings.warn(
                "Sentencepiece not installed, skip test_bert_sentencepiece_sentences_transform()."
            )
            return
>       processed = list(data_train)[0]

tests/unittest/test_transforms.py:286: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
conda/cpu/py2/lib/python2.7/site-packages/mxnet/gluon/data/dataset.py:125: in __getitem__
    return self._fn(item)
src/gluonnlp/data/transforms.py:1216: in __call__
    tokens_a = self._tokenizer(text_a)
src/gluonnlp/data/transforms.py:996: in __call__
    return self._tokenizer(sample)
src/gluonnlp/data/transforms.py:1000: in _tokenizer
    for token in self.basic_tokenizer(text):
src/gluonnlp/data/transforms.py:803: in __call__
    return self._tokenize(sample)
src/gluonnlp/data/transforms.py:807: in _tokenize
    text = self._clean_text(text)
src/gluonnlp/data/transforms.py:832: in _clean_text
    if cp in (0, 0xfffd) or self._is_control(char):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <gluonnlp.data.transforms.BERTBasicTokenizer object at 0x7ff5c46e3d10>
char = 'T'

    def _is_control(self, char):
        """Checks whether `chars` is a control character."""
        # These are technically control characters but we count them as whitespace
        # characters.
        if char in ['\t', '\n', '\r']:
            return False
>       cat = unicodedata.category(char)
E       TypeError: category() argument 1 must be unicode, not str

src/gluonnlp/data/transforms.py:846: TypeError
----------------------------- Captured stdout call -----------------------------
Downloading test-682b5d15.bpe from http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe...

haven-jeon · 2019-05-16T04:30:47Z

In GluonNLP-py2-cpu-unittest:

_________________ test_bert_sentencepiece_sentences_transform __________________
[gw1] linux2 -- Python 2.7.15 /var/lib/jenkins/workspace/gluon-nlp-cpu-py2/conda/cpu/py2/bin/python

    @pytest.mark.remote_required
    def test_bert_sentencepiece_sentences_transform():
        url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe'
        f = download(url, overwrite=True)
        bert_vocab = BERTVocab.from_sentencepiece(f)
        bert_tokenizer = t.BERTSPTokenizer(f, bert_vocab, lower=True)
        max_len = 36
        data_train_raw = SimpleDataset(
            [['This is a very awesome, life-changing sentence.']])
        transform = t.BERTSentenceTransform(bert_tokenizer,
                                            max_len,
                                            pad=True,
                                            pair=False)
        try:
            data_train = data_train_raw.transform(transform)
        except ImportError:
            warnings.warn(
                "Sentencepiece not installed, skip test_bert_sentencepiece_sentences_transform()."
            )
            return
>       processed = list(data_train)[0]

tests/unittest/test_transforms.py:286: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
conda/cpu/py2/lib/python2.7/site-packages/mxnet/gluon/data/dataset.py:125: in __getitem__
    return self._fn(item)
src/gluonnlp/data/transforms.py:1216: in __call__
    tokens_a = self._tokenizer(text_a)
src/gluonnlp/data/transforms.py:996: in __call__
    return self._tokenizer(sample)
src/gluonnlp/data/transforms.py:1000: in _tokenizer
    for token in self.basic_tokenizer(text):
src/gluonnlp/data/transforms.py:803: in __call__
    return self._tokenize(sample)
src/gluonnlp/data/transforms.py:807: in _tokenize
    text = self._clean_text(text)
src/gluonnlp/data/transforms.py:832: in _clean_text
    if cp in (0, 0xfffd) or self._is_control(char):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <gluonnlp.data.transforms.BERTBasicTokenizer object at 0x7ff5c46e3d10>
char = 'T'

    def _is_control(self, char):
        """Checks whether `chars` is a control character."""
        # These are technically control characters but we count them as whitespace
        # characters.
        if char in ['\t', '\n', '\r']:
            return False
>       cat = unicodedata.category(char)
E       TypeError: category() argument 1 must be unicode, not str

src/gluonnlp/data/transforms.py:846: TypeError
----------------------------- Captured stdout call -----------------------------
Downloading test-682b5d15.bpe from http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe...

Needs to add unicode convertion function can work both Py2, Py3.

…kenizer

src/gluonnlp/data/transforms.py

src/gluonnlp/vocab/bert.py

eric-haibin-lin · 2019-05-19T04:22:53Z

Would it work if you just change the test case to:
[[u'This is a very awesome, life-changing sentence.']])?

haven-jeon · 2019-05-19T04:37:06Z

Would it work if you just change the test case to:
[[u'This is a very awesome, life-changing sentence.']])?

It still fail because Vocab token is not unicode on PY2. It needs to check vocab tokens.

BTW, Could you give me guide to run unite test on local?

szha · 2019-05-19T04:41:12Z

The steps of GluonNLP-py2-cpu-unittest are in https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py2_cpu_unittest#L53-L56
The function called is here: https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/build_steps.groovy#L41-L61

You will need anaconda or miniconda for creating the environment.

eric-haibin-lin · 2019-05-19T05:03:50Z

I see. Then maybe we need to add some util.to_unicode():

if py2:
    return s.decode('utf-8')
else:
    return str(s)

and apply this to the output of sp_tokenizer.tokens?

haven-jeon · 2019-05-19T05:09:54Z

The steps of GluonNLP-py2-cpu-unittest are in https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py2_cpu_unittest#L53-L56
The function called is here: https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/build_steps.groovy#L41-L61

You will need anaconda or miniconda for creating the environment.

Thanks.

I see. Then maybe we need to add some util.to_unicode():
if py2:
    return s.decode('utf-8')
else:
    str(s)
and apply this to the output of sp_tokenizer.tokens?

Yes, I'm applying unicode convertion function to vocab tokens.

szha · 2019-05-19T05:15:13Z

Thanks. Does Py3 tests need other env?

Yes, there's the corresponding Jenkins file for it too.

mli · 2019-05-19T13:46:23Z

Job PR-669/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/7/index.html

mli · 2019-05-19T13:59:11Z

Job PR-669/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/8/index.html

mli · 2019-05-19T14:20:04Z

Job PR-669/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/9/index.html

mli · 2019-05-19T15:07:16Z

Job PR-669/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/10/index.html

src/gluonnlp/vocab/bert.py

eric-haibin-lin · 2019-05-22T06:31:21Z

@haven-jeon I made some changes to make sure the vocab is consistent with the one from the sentencepiece file. Please let me know if you think the change is reasonable.
@leezu I need to manually construct token2idx and idx2token for the vocab due to existing inflexible special token registration in the Vocab class (special ones are always inserted in the beginning, which makes the vocab inconsistent with the original one). The code would be much more simplified once the flexible special token registration functionality is done, I guess.

mli · 2019-05-22T07:38:30Z

Job PR-669/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/12/index.html

leezu · 2019-05-22T08:55:26Z

@eric-haibin-lin thanks; I was not aware this PR is blocked by #393.

Due to

gluon-nlp/src/gluonnlp/_constants.py

Lines 33 to 35 in c80b771

    
           UNK_IDX = 0   # This should not be changed as long as serialized token 
        
                         # embeddings redistributed on S3 contain an unknown token. 
        
                         # Blame this code change and see commit for more context.

we need to first regenerate some of our files on S3. I'm working on it.

src/gluonnlp/data/transforms.py

src/gluonnlp/data/utils.py

szha · 2019-05-23T05:10:11Z

src/gluonnlp/data/utils.py

+        unicode string
+    """
+    py_version = sys.version_info[0]
+    if py_version == 3:


python version won't change during the process, so the version check can be moved outside the function so that the check only happens once.

mli · 2019-05-23T06:20:11Z

Job PR-669/13 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/13/index.html

szha · 2019-06-02T19:02:10Z

tests/unittest/test_transforms.py

+    # segment id
+    assert all(processed[2] == np.array([0] * max_len, dtype='int32'))
+
+test_bert_sentencepiece_sentences_transform()


this call shouldn't have been added as it makes the pytest marker not function properly in some cases. I will remove it.

* add BERTSPTokenizer * update doc * fix doc * add BERTVocab.from_sentencepiece() add test case * fix doctest * fix lint error * fix doctest and lint error * update for unicode processing(py2) * remove comments * disable lint error * fix flake8 error * make vocab consistent * remove six. make convert2unicode private

haven-jeon added 2 commits April 20, 2019 11:18

add BERTSPTokenizer

b04a119

update doc

47de135

haven-jeon requested a review from szha as a code owner April 20, 2019 03:03

szha reviewed Apr 21, 2019

View reviewed changes

haven-jeon added 2 commits May 9, 2019 22:08

Merge branch 'master' into sp_tokenizer

2d30c14

fix doc

efb37d6

eric-haibin-lin reviewed May 9, 2019

View reviewed changes

eric-haibin-lin added the release focus Progress focus for release label May 10, 2019

haven-jeon added 6 commits May 12, 2019 13:17

Merge remote-tracking branch 'upstream/master' into sp_tokenizer

b35a5de

add BERTVocab.from_sentencepiece()

3a413db

add test case

Merge branch 'master' of https://github.com/dmlc/gluon-nlp into sp_to…

ceda2d8

…kenizer

fix doctest

a5aa261

fix lint error

0d81bdd

fix doctest and lint error

3ca1853

haven-jeon added 2 commits May 18, 2019 16:37

Merge branch 'master' of https://github.com/dmlc/gluon-nlp into sp_to…

4612941

…kenizer

Merge branch 'master' into sp_tokenizer

9d71e7c

eric-haibin-lin reviewed May 19, 2019

View reviewed changes

src/gluonnlp/data/transforms.py Outdated Show resolved Hide resolved

src/gluonnlp/vocab/bert.py Outdated Show resolved Hide resolved

src/gluonnlp/vocab/bert.py Outdated Show resolved Hide resolved

haven-jeon added 3 commits May 19, 2019 21:39

update for unicode processing(py2)

4a4d151

remove comments

6789600

disable lint error

c95f6f6

fix flake8 error

0dc73fb

eric-haibin-lin reviewed May 19, 2019

View reviewed changes

src/gluonnlp/vocab/bert.py Outdated Show resolved Hide resolved

EC2 Default User added 2 commits May 21, 2019 20:42

Merge remote-tracking branch 'upstream/master' into sp_tokenizer

d6001f9

make vocab consistent

aab5b27

szha reviewed May 22, 2019

View reviewed changes

src/gluonnlp/data/transforms.py Show resolved Hide resolved

szha reviewed May 22, 2019

View reviewed changes

src/gluonnlp/data/utils.py Outdated Show resolved Hide resolved

szha reviewed May 22, 2019

View reviewed changes

src/gluonnlp/data/utils.py Outdated Show resolved Hide resolved

remove six. make convert2unicode private

87a4b5f

szha reviewed May 23, 2019

View reviewed changes

szha approved these changes May 23, 2019

View reviewed changes

eric-haibin-lin approved these changes May 23, 2019

View reviewed changes

eric-haibin-lin merged commit 277bd0d into dmlc:master May 23, 2019

szha reviewed Jun 2, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BERTTokenizer for pre-trained SentencePiece model #669

Add BERTTokenizer for pre-trained SentencePiece model #669

haven-jeon commented Apr 20, 2019

mli commented Apr 20, 2019

codecov bot commented Apr 20, 2019

codecov bot commented Apr 20, 2019 •

edited

Loading

szha left a comment

haven-jeon commented May 9, 2019

eric-haibin-lin left a comment

eric-haibin-lin May 9, 2019 •

edited

Loading

haven-jeon May 12, 2019 •

edited

Loading

haven-jeon May 12, 2019

szha May 12, 2019

szha May 13, 2019

haven-jeon May 14, 2019

mli commented May 15, 2019

szha commented May 15, 2019

haven-jeon commented May 16, 2019

eric-haibin-lin commented May 19, 2019

haven-jeon commented May 19, 2019 •

edited

Loading

szha commented May 19, 2019 •

edited

Loading

eric-haibin-lin commented May 19, 2019 •

edited

Loading

haven-jeon commented May 19, 2019 •

edited

Loading

szha commented May 19, 2019

mli commented May 19, 2019

mli commented May 19, 2019

mli commented May 19, 2019

mli commented May 19, 2019

eric-haibin-lin commented May 22, 2019

mli commented May 22, 2019

leezu commented May 22, 2019

szha May 23, 2019

mli commented May 23, 2019

szha Jun 2, 2019 •

edited

Loading

eric-haibin-lin Jun 2, 2019

Add BERTTokenizer for pre-trained SentencePiece model #669

Add BERTTokenizer for pre-trained SentencePiece model #669

Conversation

haven-jeon commented Apr 20, 2019

Description

Checklist

Essentials

Changes

Comments

mli commented Apr 20, 2019

codecov bot commented Apr 20, 2019

Codecov Report

codecov bot commented Apr 20, 2019 • edited Loading

Codecov Report

szha left a comment

Choose a reason for hiding this comment

haven-jeon commented May 9, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

eric-haibin-lin May 9, 2019 • edited Loading

Choose a reason for hiding this comment

haven-jeon May 12, 2019 • edited Loading

Choose a reason for hiding this comment

haven-jeon May 12, 2019

Choose a reason for hiding this comment

szha May 12, 2019

Choose a reason for hiding this comment

szha May 13, 2019

Choose a reason for hiding this comment

haven-jeon May 14, 2019

Choose a reason for hiding this comment

mli commented May 15, 2019

szha commented May 15, 2019

haven-jeon commented May 16, 2019

eric-haibin-lin commented May 19, 2019

haven-jeon commented May 19, 2019 • edited Loading

szha commented May 19, 2019 • edited Loading

eric-haibin-lin commented May 19, 2019 • edited Loading

haven-jeon commented May 19, 2019 • edited Loading

szha commented May 19, 2019

mli commented May 19, 2019

mli commented May 19, 2019

mli commented May 19, 2019

mli commented May 19, 2019

eric-haibin-lin commented May 22, 2019

mli commented May 22, 2019

leezu commented May 22, 2019

szha May 23, 2019

Choose a reason for hiding this comment

mli commented May 23, 2019

szha Jun 2, 2019 • edited Loading

Choose a reason for hiding this comment

eric-haibin-lin Jun 2, 2019

Choose a reason for hiding this comment

codecov bot commented Apr 20, 2019 •

edited

Loading

eric-haibin-lin May 9, 2019 •

edited

Loading

haven-jeon May 12, 2019 •

edited

Loading

haven-jeon commented May 19, 2019 •

edited

Loading

szha commented May 19, 2019 •

edited

Loading

eric-haibin-lin commented May 19, 2019 •

edited

Loading

haven-jeon commented May 19, 2019 •

edited

Loading

szha Jun 2, 2019 •

edited

Loading