Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Add BERTTokenizer for pre-trained SentencePiece model #669

Merged
merged 19 commits into from
May 23, 2019

Conversation

haven-jeon
Copy link
Member

Description

#653

  • BERTTokenizer need to work with SentencePiece tokenizer model, when we build BERT from pre-training.
  • Swig object(sentencepiece object) should be used in multiprocessing.

Sample code to use.

# use the vocabulary from pre-trained model for tokenization
bert_tokenizer = nlp.data.BERTSPTokenizer("test-0690baed.bpe", vocabulary, lower=True)
# maximum sequence length
max_len = 128
# the labels for the two classes
all_labels = ["0", "1"]
# whether to transform the data as sentence pairs.
# for single sentence classification, set pair=False
pair = True
transform = dataset.BERTDatasetTransform(bert_tokenizer, max_len,
                                         labels=all_labels,
                                         label_dtype='int32',
                                         pad=True,
                                         pair=pair)
data_train = data_train_raw.transform(transform)

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@haven-jeon haven-jeon requested a review from szha as a code owner April 20, 2019 03:03
@mli
Copy link
Member

mli commented Apr 20, 2019

Job PR-669/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/1/index.html

@codecov
Copy link

codecov bot commented Apr 20, 2019

Codecov Report

Merging #669 into master will decrease coverage by 0.05%.
The diff coverage is 25%.

@@            Coverage Diff             @@
##           master     #669      +/-   ##
==========================================
- Coverage   63.74%   63.68%   -0.06%     
==========================================
  Files         141      141              
  Lines       12902    12922      +20     
==========================================
+ Hits         8224     8229       +5     
- Misses       4678     4693      +15
Flag Coverage Δ
#PR669 63.68% <25%> (?)
#master ?
#notserial 42.6% <25%> (-0.03%) ⬇️
#py2 63.43% <25%> (-0.06%) ⬇️
#py3 63.56% <25%> (-0.06%) ⬇️
#serial 49.18% <25%> (-0.04%) ⬇️

@codecov
Copy link

codecov bot commented Apr 20, 2019

Codecov Report

Merging #669 into master will increase coverage by 0.11%.
The diff coverage is 73.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #669      +/-   ##
==========================================
+ Coverage   90.53%   90.65%   +0.11%     
==========================================
  Files          65       65              
  Lines        5991     6067      +76     
==========================================
+ Hits         5424     5500      +76     
  Misses        567      567
Impacted Files Coverage Δ
src/gluonnlp/data/transforms.py 78.18% <47.05%> (+0.82%) ⬆️
src/gluonnlp/data/utils.py 78.35% <77.27%> (-1.48%) ⬇️
src/gluonnlp/vocab/bert.py 85.04% <98.03%> (+10.04%) ⬆️
src/gluonnlp/data/dataloader.py 88.79% <0%> (+5.17%) ⬆️

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. How much does it differ from our sentence piece tokenizer? (I just realized that this class wasn't shown in the data API doc.)

@haven-jeon
Copy link
Member Author

Thanks for the patch. How much does it differ from our sentence piece tokenizer? (I just realized that this class wasn't shown in the data API doc.)

To use in BERTSentenceTransform I suggested this class.
BERTSentenceTransform requires BERTTokenizer interface.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Would you mind adding an unit test for this?

src/gluonnlp/data/transforms.py Outdated Show resolved Hide resolved
def _tokenize_wordpiece(self, text):
"""Tokenizes a piece of text into its word pieces.

This use Google's SentencePiece tokenizer model file
Copy link
Member

@eric-haibin-lin eric-haibin-lin May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use -> This uses

Copy link
Member Author

@haven-jeon haven-jeon May 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Would you mind adding an unit test for this?

I am working.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-haibin-lin How can I upload test sentencepiece model for unittest like test_sentencepiece_tokenizer_subword_regularization ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haven-jeon I can help with that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uploaded the file to apache-mxnet/gluon/dataset/vocab/test-682b5d15.bpe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am working.

@eric-haibin-lin eric-haibin-lin added the release focus Progress focus for release label May 10, 2019
@mli
Copy link
Member

mli commented May 15, 2019

Job PR-669/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/6/index.html

@szha
Copy link
Member

szha commented May 15, 2019

In GluonNLP-py2-cpu-unittest:

_________________ test_bert_sentencepiece_sentences_transform __________________
[gw1] linux2 -- Python 2.7.15 /var/lib/jenkins/workspace/gluon-nlp-cpu-py2/conda/cpu/py2/bin/python

    @pytest.mark.remote_required
    def test_bert_sentencepiece_sentences_transform():
        url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe'
        f = download(url, overwrite=True)
        bert_vocab = BERTVocab.from_sentencepiece(f)
        bert_tokenizer = t.BERTSPTokenizer(f, bert_vocab, lower=True)
        max_len = 36
        data_train_raw = SimpleDataset(
            [['This is a very awesome, life-changing sentence.']])
        transform = t.BERTSentenceTransform(bert_tokenizer,
                                            max_len,
                                            pad=True,
                                            pair=False)
        try:
            data_train = data_train_raw.transform(transform)
        except ImportError:
            warnings.warn(
                "Sentencepiece not installed, skip test_bert_sentencepiece_sentences_transform()."
            )
            return
>       processed = list(data_train)[0]

tests/unittest/test_transforms.py:286: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
conda/cpu/py2/lib/python2.7/site-packages/mxnet/gluon/data/dataset.py:125: in __getitem__
    return self._fn(item)
src/gluonnlp/data/transforms.py:1216: in __call__
    tokens_a = self._tokenizer(text_a)
src/gluonnlp/data/transforms.py:996: in __call__
    return self._tokenizer(sample)
src/gluonnlp/data/transforms.py:1000: in _tokenizer
    for token in self.basic_tokenizer(text):
src/gluonnlp/data/transforms.py:803: in __call__
    return self._tokenize(sample)
src/gluonnlp/data/transforms.py:807: in _tokenize
    text = self._clean_text(text)
src/gluonnlp/data/transforms.py:832: in _clean_text
    if cp in (0, 0xfffd) or self._is_control(char):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <gluonnlp.data.transforms.BERTBasicTokenizer object at 0x7ff5c46e3d10>
char = 'T'

    def _is_control(self, char):
        """Checks whether `chars` is a control character."""
        # These are technically control characters but we count them as whitespace
        # characters.
        if char in ['\t', '\n', '\r']:
            return False
>       cat = unicodedata.category(char)
E       TypeError: category() argument 1 must be unicode, not str

src/gluonnlp/data/transforms.py:846: TypeError
----------------------------- Captured stdout call -----------------------------
Downloading test-682b5d15.bpe from http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe...

@haven-jeon
Copy link
Member Author

In GluonNLP-py2-cpu-unittest:

_________________ test_bert_sentencepiece_sentences_transform __________________
[gw1] linux2 -- Python 2.7.15 /var/lib/jenkins/workspace/gluon-nlp-cpu-py2/conda/cpu/py2/bin/python

    @pytest.mark.remote_required
    def test_bert_sentencepiece_sentences_transform():
        url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe'
        f = download(url, overwrite=True)
        bert_vocab = BERTVocab.from_sentencepiece(f)
        bert_tokenizer = t.BERTSPTokenizer(f, bert_vocab, lower=True)
        max_len = 36
        data_train_raw = SimpleDataset(
            [['This is a very awesome, life-changing sentence.']])
        transform = t.BERTSentenceTransform(bert_tokenizer,
                                            max_len,
                                            pad=True,
                                            pair=False)
        try:
            data_train = data_train_raw.transform(transform)
        except ImportError:
            warnings.warn(
                "Sentencepiece not installed, skip test_bert_sentencepiece_sentences_transform()."
            )
            return
>       processed = list(data_train)[0]

tests/unittest/test_transforms.py:286: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
conda/cpu/py2/lib/python2.7/site-packages/mxnet/gluon/data/dataset.py:125: in __getitem__
    return self._fn(item)
src/gluonnlp/data/transforms.py:1216: in __call__
    tokens_a = self._tokenizer(text_a)
src/gluonnlp/data/transforms.py:996: in __call__
    return self._tokenizer(sample)
src/gluonnlp/data/transforms.py:1000: in _tokenizer
    for token in self.basic_tokenizer(text):
src/gluonnlp/data/transforms.py:803: in __call__
    return self._tokenize(sample)
src/gluonnlp/data/transforms.py:807: in _tokenize
    text = self._clean_text(text)
src/gluonnlp/data/transforms.py:832: in _clean_text
    if cp in (0, 0xfffd) or self._is_control(char):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <gluonnlp.data.transforms.BERTBasicTokenizer object at 0x7ff5c46e3d10>
char = 'T'

    def _is_control(self, char):
        """Checks whether `chars` is a control character."""
        # These are technically control characters but we count them as whitespace
        # characters.
        if char in ['\t', '\n', '\r']:
            return False
>       cat = unicodedata.category(char)
E       TypeError: category() argument 1 must be unicode, not str

src/gluonnlp/data/transforms.py:846: TypeError
----------------------------- Captured stdout call -----------------------------
Downloading test-682b5d15.bpe from http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe...

Needs to add unicode convertion function can work both Py2, Py3.

src/gluonnlp/data/transforms.py Outdated Show resolved Hide resolved
src/gluonnlp/vocab/bert.py Outdated Show resolved Hide resolved
src/gluonnlp/vocab/bert.py Outdated Show resolved Hide resolved
@eric-haibin-lin
Copy link
Member

Would it work if you just change the test case to:
[[u'This is a very awesome, life-changing sentence.']])?

@haven-jeon
Copy link
Member Author

haven-jeon commented May 19, 2019

Would it work if you just change the test case to:
[[u'This is a very awesome, life-changing sentence.']])?

It still fail because Vocab token is not unicode on PY2. It needs to check vocab tokens.

BTW, Could you give me guide to run unite test on local?

@szha
Copy link
Member

szha commented May 19, 2019

The steps of GluonNLP-py2-cpu-unittest are in https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py2_cpu_unittest#L53-L56
The function called is here: https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/build_steps.groovy#L41-L61

You will need anaconda or miniconda for creating the environment.

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented May 19, 2019

I see. Then maybe we need to add some util.to_unicode():

if py2:
    return s.decode('utf-8')
else:
    return str(s)

and apply this to the output of sp_tokenizer.tokens?

@haven-jeon
Copy link
Member Author

haven-jeon commented May 19, 2019

The steps of GluonNLP-py2-cpu-unittest are in https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py2_cpu_unittest#L53-L56
The function called is here: https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/build_steps.groovy#L41-L61

You will need anaconda or miniconda for creating the environment.

Thanks.

I see. Then maybe we need to add some util.to_unicode():

if py2:
    return s.decode('utf-8')
else:
    str(s)

and apply this to the output of sp_tokenizer.tokens?

Yes, I'm applying unicode convertion function to vocab tokens.

@szha
Copy link
Member

szha commented May 19, 2019

Thanks. Does Py3 tests need other env?

Yes, there's the corresponding Jenkins file for it too.

@mli
Copy link
Member

mli commented May 19, 2019

Job PR-669/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/7/index.html

@mli
Copy link
Member

mli commented May 19, 2019

Job PR-669/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/8/index.html

@mli
Copy link
Member

mli commented May 19, 2019

Job PR-669/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/9/index.html

@mli
Copy link
Member

mli commented May 19, 2019

Job PR-669/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/10/index.html

@eric-haibin-lin
Copy link
Member

@haven-jeon I made some changes to make sure the vocab is consistent with the one from the sentencepiece file. Please let me know if you think the change is reasonable.
@leezu I need to manually construct token2idx and idx2token for the vocab due to existing inflexible special token registration in the Vocab class (special ones are always inserted in the beginning, which makes the vocab inconsistent with the original one). The code would be much more simplified once the flexible special token registration functionality is done, I guess.

@mli
Copy link
Member

mli commented May 22, 2019

Job PR-669/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/12/index.html

@leezu
Copy link
Contributor

leezu commented May 22, 2019

@eric-haibin-lin thanks; I was not aware this PR is blocked by #393.

Due to

UNK_IDX = 0 # This should not be changed as long as serialized token
# embeddings redistributed on S3 contain an unknown token.
# Blame this code change and see commit for more context.

we need to first regenerate some of our files on S3. I'm working on it.

src/gluonnlp/data/utils.py Outdated Show resolved Hide resolved
src/gluonnlp/data/utils.py Outdated Show resolved Hide resolved
unicode string
"""
py_version = sys.version_info[0]
if py_version == 3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python version won't change during the process, so the version check can be moved outside the function so that the check only happens once.

@mli
Copy link
Member

mli commented May 23, 2019

Job PR-669/13 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-669/13/index.html

@eric-haibin-lin eric-haibin-lin merged commit 277bd0d into dmlc:master May 23, 2019
# segment id
assert all(processed[2] == np.array([0] * max_len, dtype='int32'))

test_bert_sentencepiece_sentences_transform()
Copy link
Member

@szha szha Jun 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call shouldn't have been added as it makes the pytest marker not function properly in some cases. I will remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do

paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019
* add BERTSPTokenizer

* update doc

* fix doc

* add BERTVocab.from_sentencepiece()
add test case

* fix doctest

* fix lint error

* fix doctest and lint error

* update for unicode processing(py2)

* remove comments

* disable lint error

* fix flake8 error

* make vocab consistent

* remove six. make convert2unicode private
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants