Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook for text classification with BERT and text classification tokenization caching #382

Merged
merged 2 commits into from
Feb 25, 2020
Merged

Add notebook for text classification with BERT and text classification tokenization caching #382

merged 2 commits into from
Feb 25, 2020

Conversation

ericharper
Copy link
Collaborator

Add notebooks detailing text classification with BERT using NeMo.

text_classification_with_bert.ipynb

text_classification_with_bert_inference.ipynb

We also add a script that converts the text classification tokenization to hdf5: preproc_text_data.py

We then modify the text classification dataset to include a preproc version and we modify the text classification data layer to also have a preproc version.

The text classification preproc data layer is similar to bert pretraining preproc data layer

We found it more efficient to cache the tokenization when training. We were able to get better training times and better scaling when using multiple GPUs.

The notebooks also visualize the BERT embeddings before and after fine-tuning by applying T-SNE.

The inference notebooks allow the user to see how well the fine-tuned models are classifying their data.

We also include a script to download the SST-2 dataset for use in the notebook.

We changed the text classification dataset to use the user specified max seq length.

@lgtm-com
Copy link

lgtm-com bot commented Feb 19, 2020

This pull request introduces 3 alerts when merging 0bc3c7a into abaac1b - view on LGTM.com

new alerts:

  • 3 for Unused import

@okuchaiev
Copy link
Member

can you please sign your commits first? The DCO's bot output shows you how it can be done https://github.com/NVIDIA/NeMo/pull/382/checks?check_run_id=454165747

@ericharper
Copy link
Collaborator Author

Okay, I was able to sign off on the commits.

@ekmb ekmb changed the title Add notebook for text classification with BERT and text classification tokenization caching [WIP]Add notebook for text classification with BERT and text classification tokenization caching Feb 19, 2020
@ekmb ekmb self-requested a review February 19, 2020 17:11
@lgtm-com
Copy link

lgtm-com bot commented Feb 19, 2020

This pull request introduces 3 alerts when merging 3e2a987 into abaac1b - view on LGTM.com

new alerts:

  • 3 for Unused import

@okuchaiev
Copy link
Member

@ericharper there were some changes with how NLP collection is organized (and I think, for now, we are done with changes there).
Could you please rebase your PR on the latest master?

…ded text_classification_with_bert notebooks

Signed-off-by: ericharper <complex451@gmail.com>

changed name text_classification to text_classification_with_bert

Signed-off-by: ericharper <complex451@gmail.com>

Delete text_classification_inference.ipynb

Signed-off-by: ericharper <complex451@gmail.com>

Delete text_classification.ipynb

Signed-off-by: ericharper <complex451@gmail.com>

added preproc text data layer and dataset and text classification notebook

changed name text_classification to text_classification_with_bert

Signed-off-by: ericharper <complex451@gmail.com>

renamed text_classification_notebook

Signed-off-by: ericharper <complex451@gmail.com>

renamed text_classification_inference

Signed-off-by: ericharper <complex451@gmail.com>

add use_cache flag to BertTextClassificationDataset and BertTextClassificationDatalayer and update text_classification_with_bert script and notebooks

Signed-off-by: ericharper <complex451@gmail.com>

removed preproc_text_data.py

Signed-off-by: ericharper <complex451@gmail.com>
@ericharper
Copy link
Collaborator Author

Rebased to master.

Signed-off-by: ericharper <complex451@gmail.com>
@okuchaiev
Copy link
Member

looks fine to me, @ekmb can you please have a look too? @ericharper is this still WIP?

@ericharper
Copy link
Collaborator Author

finished on my end, but happy to make more changes if needed

@ekmb ekmb changed the title [WIP]Add notebook for text classification with BERT and text classification tokenization caching Add notebook for text classification with BERT and text classification tokenization caching Feb 25, 2020
@ekmb ekmb merged commit b53e349 into NVIDIA:master Feb 25, 2020
@ekmb
Copy link
Collaborator

ekmb commented Feb 25, 2020

@ericharper Thank you for your contribution!

dcurran90 pushed a commit to dcurran90/NeMo that referenced this pull request Oct 15, 2024
* combine lab-troubleshooting.md and output optimization doc

Signed-off-by: jaideepr97 <jaideep.r97@gmail.com>

* Update README.md

Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>

* Update TROUBLESHOOTING.md

Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>

* Update TROUBLESHOOTING.md

Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>

* Update TROUBLESHOOTING.md

Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>

* Update TROUBLESHOOTING.md

Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>

* Update TROUBLESHOOTING.md

Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>

* Update TROUBLESHOOTING.md

Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>

* address review comments

Signed-off-by: jaideepr97 <jaideep.r97@gmail.com>

* Update TROUBLESHOOTING.md

Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>

* Update TROUBLESHOOTING.md

Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>

* address review comment

Signed-off-by: jaideepr97 <jaideep.r97@gmail.com>

* update rouge threshold value in example

Signed-off-by: jaideepr97 <jaideep.r97@gmail.com>

---------

Signed-off-by: jaideepr97 <jaideep.r97@gmail.com>
Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com>
Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
dcurran90 pushed a commit to dcurran90/NeMo that referenced this pull request Oct 15, 2024
PR NVIDIA#382 moved the content from lab-troubleshoot.md into TROUBLESHOOTING.md
but forgot to delete the lab-troubleshoot.md file. This cleans that up.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants