Add notebook for text classification with BERT and text classification tokenization caching #382

ericharper · 2020-02-19T00:23:37Z

Add notebooks detailing text classification with BERT using NeMo.

text_classification_with_bert_inference.ipynb

We also add a script that converts the text classification tokenization to hdf5: preproc_text_data.py

We then modify the text classification dataset to include a preproc version and we modify the text classification data layer to also have a preproc version.

The text classification preproc data layer is similar to bert pretraining preproc data layer

We found it more efficient to cache the tokenization when training. We were able to get better training times and better scaling when using multiple GPUs.

The notebooks also visualize the BERT embeddings before and after fine-tuning by applying T-SNE.

The inference notebooks allow the user to see how well the fine-tuned models are classifying their data.

We also include a script to download the SST-2 dataset for use in the notebook.

We changed the text classification dataset to use the user specified max seq length.

lgtm-com · 2020-02-19T00:34:24Z

This pull request introduces 3 alerts when merging 0bc3c7a into abaac1b - view on LGTM.com

new alerts:

3 for Unused import

okuchaiev · 2020-02-19T00:39:32Z

can you please sign your commits first? The DCO's bot output shows you how it can be done https://github.com/NVIDIA/NeMo/pull/382/checks?check_run_id=454165747

ericharper · 2020-02-19T01:19:15Z

Okay, I was able to sign off on the commits.

lgtm-com · 2020-02-19T17:52:34Z

This pull request introduces 3 alerts when merging 3e2a987 into abaac1b - view on LGTM.com

new alerts:

3 for Unused import

okuchaiev · 2020-02-21T20:56:13Z

@ericharper there were some changes with how NLP collection is organized (and I think, for now, we are done with changes there).
Could you please rebase your PR on the latest master?

…ded text_classification_with_bert notebooks Signed-off-by: ericharper <complex451@gmail.com> changed name text_classification to text_classification_with_bert Signed-off-by: ericharper <complex451@gmail.com> Delete text_classification_inference.ipynb Signed-off-by: ericharper <complex451@gmail.com> Delete text_classification.ipynb Signed-off-by: ericharper <complex451@gmail.com> added preproc text data layer and dataset and text classification notebook changed name text_classification to text_classification_with_bert Signed-off-by: ericharper <complex451@gmail.com> renamed text_classification_notebook Signed-off-by: ericharper <complex451@gmail.com> renamed text_classification_inference Signed-off-by: ericharper <complex451@gmail.com> add use_cache flag to BertTextClassificationDataset and BertTextClassificationDatalayer and update text_classification_with_bert script and notebooks Signed-off-by: ericharper <complex451@gmail.com> removed preproc_text_data.py Signed-off-by: ericharper <complex451@gmail.com>

ericharper · 2020-02-22T20:29:15Z

Rebased to master.

Signed-off-by: ericharper <complex451@gmail.com>

okuchaiev · 2020-02-24T18:32:20Z

looks fine to me, @ekmb can you please have a look too? @ericharper is this still WIP?

ericharper · 2020-02-24T18:33:33Z

finished on my end, but happy to make more changes if needed

nemo/collections/nlp/data/datasets/text_classification_dataset.py

ekmb · 2020-02-25T01:30:23Z

@ericharper Thank you for your contribution!

* combine lab-troubleshooting.md and output optimization doc Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> * Update README.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * address review comments Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * address review comment Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> * update rouge threshold value in example Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> --------- Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>

PR NVIDIA#382 moved the content from lab-troubleshoot.md into TROUBLESHOOTING.md but forgot to delete the lab-troubleshoot.md file. This cleans that up. Signed-off-by: Ben Browning <bbrownin@redhat.com>

ekmb changed the title ~~Add notebook for text classification with BERT and text classification tokenization caching~~ [WIP]Add notebook for text classification with BERT and text classification tokenization caching Feb 19, 2020

ekmb self-requested a review February 19, 2020 17:11

ran python setup.py style --fix

381321d

Signed-off-by: ericharper <complex451@gmail.com>

ekmb reviewed Feb 24, 2020

View reviewed changes

nemo/collections/nlp/data/datasets/text_classification_dataset.py Show resolved Hide resolved

ekmb changed the title ~~[WIP]Add notebook for text classification with BERT and text classification tokenization caching~~ Add notebook for text classification with BERT and text classification tokenization caching Feb 25, 2020

ekmb approved these changes Feb 25, 2020

View reviewed changes

ekmb merged commit b53e349 into NVIDIA:master Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notebook for text classification with BERT and text classification tokenization caching #382

Add notebook for text classification with BERT and text classification tokenization caching #382

ericharper commented Feb 19, 2020

lgtm-com bot commented Feb 19, 2020

okuchaiev commented Feb 19, 2020

ericharper commented Feb 19, 2020

lgtm-com bot commented Feb 19, 2020

okuchaiev commented Feb 21, 2020

ericharper commented Feb 22, 2020

okuchaiev commented Feb 24, 2020

ericharper commented Feb 24, 2020

ekmb commented Feb 25, 2020

Add notebook for text classification with BERT and text classification tokenization caching #382

Add notebook for text classification with BERT and text classification tokenization caching #382

Conversation

ericharper commented Feb 19, 2020

lgtm-com bot commented Feb 19, 2020

okuchaiev commented Feb 19, 2020

ericharper commented Feb 19, 2020

lgtm-com bot commented Feb 19, 2020

okuchaiev commented Feb 21, 2020

ericharper commented Feb 22, 2020

okuchaiev commented Feb 24, 2020

ericharper commented Feb 24, 2020

ekmb commented Feb 25, 2020