-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add notebook for text classification with BERT and text classification tokenization caching #382
Conversation
This pull request introduces 3 alerts when merging 0bc3c7a into abaac1b - view on LGTM.com new alerts:
|
can you please sign your commits first? The DCO's bot output shows you how it can be done https://github.com/NVIDIA/NeMo/pull/382/checks?check_run_id=454165747 |
Okay, I was able to sign off on the commits. |
This pull request introduces 3 alerts when merging 3e2a987 into abaac1b - view on LGTM.com new alerts:
|
@ericharper there were some changes with how NLP collection is organized (and I think, for now, we are done with changes there). |
…ded text_classification_with_bert notebooks Signed-off-by: ericharper <complex451@gmail.com> changed name text_classification to text_classification_with_bert Signed-off-by: ericharper <complex451@gmail.com> Delete text_classification_inference.ipynb Signed-off-by: ericharper <complex451@gmail.com> Delete text_classification.ipynb Signed-off-by: ericharper <complex451@gmail.com> added preproc text data layer and dataset and text classification notebook changed name text_classification to text_classification_with_bert Signed-off-by: ericharper <complex451@gmail.com> renamed text_classification_notebook Signed-off-by: ericharper <complex451@gmail.com> renamed text_classification_inference Signed-off-by: ericharper <complex451@gmail.com> add use_cache flag to BertTextClassificationDataset and BertTextClassificationDatalayer and update text_classification_with_bert script and notebooks Signed-off-by: ericharper <complex451@gmail.com> removed preproc_text_data.py Signed-off-by: ericharper <complex451@gmail.com>
Rebased to master. |
Signed-off-by: ericharper <complex451@gmail.com>
looks fine to me, @ekmb can you please have a look too? @ericharper is this still WIP? |
finished on my end, but happy to make more changes if needed |
@ericharper Thank you for your contribution! |
* combine lab-troubleshooting.md and output optimization doc Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> * Update README.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * address review comments Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * Update TROUBLESHOOTING.md Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> * address review comment Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> * update rouge threshold value in example Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> --------- Signed-off-by: jaideepr97 <jaideep.r97@gmail.com> Signed-off-by: Jaideep Rao <jaideep.r97@gmail.com> Co-authored-by: Kelly Brown <86735520+kelbrown20@users.noreply.github.com>
PR NVIDIA#382 moved the content from lab-troubleshoot.md into TROUBLESHOOTING.md but forgot to delete the lab-troubleshoot.md file. This cleans that up. Signed-off-by: Ben Browning <bbrownin@redhat.com>
Add notebooks detailing text classification with BERT using NeMo.
text_classification_with_bert.ipynb
text_classification_with_bert_inference.ipynb
We also add a script that converts the text classification tokenization to hdf5: preproc_text_data.py
We then modify the text classification dataset to include a preproc version and we modify the text classification data layer to also have a preproc version.
The text classification preproc data layer is similar to bert pretraining preproc data layer
We found it more efficient to cache the tokenization when training. We were able to get better training times and better scaling when using multiple GPUs.
The notebooks also visualize the BERT embeddings before and after fine-tuning by applying T-SNE.
The inference notebooks allow the user to see how well the fine-tuned models are classifying their data.
We also include a script to download the SST-2 dataset for use in the notebook.
We changed the text classification dataset to use the user specified max seq length.