[Tutorial] Token classification tutorial for USPTO claims text with HF AutoTrain #5375

bikash119 · 2024-08-02T16:58:14Z

Description

A tutorial on how to use Argilla for annotation and use the annotated dataset to train a model using HuggingFace AutoTrain

Closes #<issue_number>

Type of change

Documentation update

How Has This Been Tested

I added relevant documentation
I followed the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation

review-notebook-app · 2024-08-02T16:58:19Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

for more information, see https://pre-commit.ci

davidberenstein1957

Hi, thanks for this PR. It looks very advanced. I believe the text is still split up by individual letters. Would you be able to fix that?

…ace between characters in a word

for more information, see https://pre-commit.ci

bikash119 · 2024-08-05T17:11:20Z

Thank you @davidberenstein1957 for the review comments. I have modified the images and reran the notebook to display updated images.

davidberenstein1957 · 2024-08-07T07:24:12Z

Hi @bikash119, could you also add an overview of how you can run inference with the model and log that back into argilla?

…nding records and pushing the records back to Argilla

for more information, see https://pre-commit.ci

bikash119 · 2024-08-08T07:05:34Z

Thank you for the suggestion @davidberenstein1957 . I have updated the notebook to generate predictions and push them back to Argilla Dataset. Please share your feedback.

davidberenstein1957 · 2024-08-08T08:01:21Z

Hi @bikash119, took some time to review again.

Create a Dataset with Argilla Python SDK does not contain an organize import sections, also, we might run all installs and organize all of the imports at the top of the notebooks separately :)
sometimes there is a bit too much output being printed, for example the "ERROR: pip's dependency resolver does " and (None, None), which makes it a bit messy
I think we can also reduce some clutter by not having too much commented out code. For example, the flow to switch from dataset to a list and or comments like "# Load model directly"
You don't need to use print statement to print things in notebook cells, at te end of a cell.
Sometimes you seem to use an indent with 4 spaces, and sometimes only 2.
Step 7: Push dataset to Hugginface Hub, pushes the dataset and loads it directly after, which seems a bit double, also it reprint the exaxt same data obtained before, so I think we can simplify that a bit.
it would be nice to add some type hinting to functions
instead of from transformers import AutoTokenizer, AutoModelForTokenClassification, we might directly use the pipeline through from transformers import pipeline and looad the model through there.
Some sections like "Using AutoTrain UI" and "Model Fine-tuning using AutoTrain" are very nicely documented but others might have some more context added, not too much but just a bit to guide the story :)

Overall it is looking very nice! when we are done, we can post the blog on https://huggingface.co/blog, socials and add a reference to it form our docs.

…argilla-io/argilla/pull/5375\#issuecomment-2275196408

: https://github.com/argilla-io/argilla/pull/5375\#issuecomment-2275196408

for more information, see https://pre-commit.ci

bikash119 · 2024-08-09T05:14:05Z

Hi @davidberenstein1957 ,
Please let me know your feedback.

Create a Dataset with Argilla Python SDK does not contain an organize import sections, also, we might run all installs and organize all of the imports at the top of the notebooks separately :)
sometimes there is a bit too much output being printed, for example the "ERROR: pip's dependency resolver does " and (None, None), which makes it a bit messy
I think we can also reduce some clutter by not having too much commented out code. For example, the flow to switch from dataset to a list and or comments like "# Load model directly"
You don't need to use print statement to print things in notebook cells, at the end of a cell.

Added a DEBUG flag for print statements. Let me know if this looks good or else will get rid of them. I wanted to keep them so that the audience can understand what each step is intended to do.

Sometimes you seem to use an indent with 4 spaces, and sometimes only 2. Used 4 spaces consistently
Step 7: Push dataset to Huggingface Hub, pushes the dataset and loads it directly after, which seems a bit double, also it reprint the exact same data obtained before, so I think we can simplify that a bit.
it would be nice to add some type hinting to functions. Added docstring for most of the functions.
instead of from transformers import AutoTokenizer, AutoModelForTokenClassification, we might directly use the pipeline through from transformers import pipeline and load the model through there.
Some sections like "Using AutoTrain UI" and "Model Fine-tuning using AutoTrain" are very nicely documented but others might have some more context added, not too much but just a bit to guide the story :)

Can you please help me with some pointers , will add them. I feel , I can add a few points for Argilla , but unable to come up with pointers to get started.

Thank you @davidberenstein1957 for the encouragement and guidance. I have learnt a lot in the process

for more information, see https://pre-commit.ci

bikash119 · 2024-08-13T15:24:14Z

Hi @davidberenstein1957 , as we discussed during our meeting.
Added context to

configure dataset step
the need of filter queries
inference step
insert predicted data to Argilla Dataset

Hope this aligns with our discussion points.

davidberenstein1957 · 2024-08-16T12:44:59Z

Hi @bikash119, the text looks nice.

I would not use the DEBUG statements everywhere but just print the outputs in certain cell where you feel that is needed. Also, you don't need to add a 'print' statement when you want to output variables at the end of the cell. You can simply remove it.

print(my_variable) # will be printed

my_other_variable # will not be printed
my_variable # will be printed

davidberenstein1957 · 2024-08-19T08:18:46Z

We don't need to update the Dockerfile anymore

Update the Dockerfile:
Go to https://huggingface.co/spaces///blob/main/Dockerfile
Change FROM argilla/argilla-quickstart:v1.29.0 to FROM argilla/argilla-quickstart:v2.0.0rc2

In general a redirect to https://docs.argilla.io/dev/getting_started/how-to-configure-argilla-on-huggingface/ might also be nice.

bikash119 · 2024-08-26T14:58:00Z

Hi @davidberenstein1957 ,
hope you're doing well! Just a friendly reminder that this PR is waiting for your review. Your input is valuable here, and we'd love to hear your thoughts. Let me know if you have any questions or need more context. Thanks!

davidberenstein1957

Hi, I think the blog looks great. Would you be able to request to join this organizaiton https://huggingface.co/blog-explorers? We can then let copy the blog over to https://huggingface.co/blog and publish it there :)

bikash119 · 2024-08-27T16:24:03Z

Thanks @davidberenstein1957 . Request submitted. Will wait for the acceptance and revert back.

for more information, see https://pre-commit.ci

This reverts commit 0745732.

… into argilla_with_autotrain

for more information, see https://pre-commit.ci

Modified the markdown to get rid of colab style.

…wn file

for more information, see https://pre-commit.ci

For some weird reason the colab styles are getting added to the notebook. Will check this later.

for more information, see https://pre-commit.ci

… into argilla_with_autotrain

for more information, see https://pre-commit.ci

… into argilla_with_autotrain

colab style removal

for more information, see https://pre-commit.ci

… into argilla_with_autotrain

davidberenstein1957 · 2024-10-17T07:24:11Z

@bikash119, closing this because it was published here: https://huggingface.co/blog/bikashpatra/legal-data-token-classification-fine-tuning

token classification tutorial for USPTO claims text with HF AutoTrain

6ddfc56

[pre-commit.ci] auto fixes from pre-commit.com hooks

f9a45c2

for more information, see https://pre-commit.ci

bikash119 changed the title ~~token classification tutorial for USPTO claims text with HF AutoTrain~~ [Tutorial] Token classification tutorial for USPTO claims text with HF AutoTrain Aug 5, 2024

davidberenstein1957 reviewed Aug 5, 2024

View reviewed changes

bikash119 and others added 3 commits August 5, 2024 21:48

Updated images / screenshots of the text to be annotated to remove sp…

68efce5

…ace between characters in a word

Executed the notebook to display the updated images

edf62a5

[pre-commit.ci] auto fixes from pre-commit.com hooks

d6f2b04

for more information, see https://pre-commit.ci

bikash119 and others added 2 commits August 8, 2024 12:25

Use predictions from the fine tuned model to perform prediction on pe…

198e101

…nding records and pushing the records back to Argilla

[pre-commit.ci] auto fixes from pre-commit.com hooks

7cd27d8

for more information, see https://pre-commit.ci

bikash119 referenced this pull request in bikash119/argilla_autotrain Aug 9, 2024

Modified the notebook with comments from @david : https://github.com/…

b6b9682

…argilla-io/argilla/pull/5375\#issuecomment-2275196408

bikash119 and others added 2 commits August 9, 2024 10:27

Modified the notebook based on review feedback from @davidberenstein1957

ffa0102

: https://github.com/argilla-io/argilla/pull/5375\#issuecomment-2275196408

[pre-commit.ci] auto fixes from pre-commit.com hooks

90721a3

for more information, see https://pre-commit.ci

bikash119 and others added 8 commits August 9, 2024 11:09

Uncommnted autotrain-advanced installation using pip

33b3dcc

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a76398

for more information, see https://pre-commit.ci

updated the inference code to use transformers pipeline api

ec8fb13

[pre-commit.ci] auto fixes from pre-commit.com hooks

553e733

for more information, see https://pre-commit.ci

Added conclusion section

c013226

[pre-commit.ci] auto fixes from pre-commit.com hooks

b6f4b69

for more information, see https://pre-commit.ci

Added a few text cells to brief about steps in subsequent code cells

41d846d

[pre-commit.ci] auto fixes from pre-commit.com hooks

66a4dd0

for more information, see https://pre-commit.ci

Merge branch 'main' into argilla_with_autotrain

069e350

davidberenstein1957 reviewed Aug 27, 2024

View reviewed changes

bikash119 and others added 25 commits September 4, 2024 18:10

Merge branch 'main' into argilla_with_autotrain

25d2c72

Merge remote-tracking branch 'upstream/main' into argilla_with_autotrain

bb46356

notebook converted to markdown

39e2a3a

[pre-commit.ci] auto fixes from pre-commit.com hooks

e2cdc23

for more information, see https://pre-commit.ci

notebook converted to markdown using quarto

0745732

Revert "notebook converted to markdown using quarto"

4df4e77

This reverts commit 0745732.

Merge branch 'argilla_with_autotrain' of github.com:bikash119/argilla…

d1d6eba

… into argilla_with_autotrain

notebook converted to markdown using quarto

d005033

[pre-commit.ci] auto fixes from pre-commit.com hooks

67de7fd

for more information, see https://pre-commit.ci

Update token_classification_tutorial.md

25f1c19

Modified the markdown to get rid of colab style.

updated the notebook to have proper indexing and recreated the markdo…

4cc3e6e

…wn file

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c78072

for more information, see https://pre-commit.ci

Removed colab style

77af323

For some weird reason the colab styles are getting added to the notebook. Will check this later.

added a few left out indexes

8b5b66b

[pre-commit.ci] auto fixes from pre-commit.com hooks

8b7fcbd

for more information, see https://pre-commit.ci

added acknowledgment section

88557ff

Merge branch 'argilla_with_autotrain' of github.com:bikash119/argilla…

558b2b9

… into argilla_with_autotrain

minor fixes

eacfb08

[pre-commit.ci] auto fixes from pre-commit.com hooks

21ed63e

for more information, see https://pre-commit.ci

minor fixes

b89a65d

Merge branch 'argilla_with_autotrain' of github.com:bikash119/argilla…

abc551f

… into argilla_with_autotrain

Update token_classification_tutorial.md

ef223c3

colab style removal

[pre-commit.ci] auto fixes from pre-commit.com hooks

b360896

for more information, see https://pre-commit.ci

updated

0ce72f7

Merge branch 'argilla_with_autotrain' of github.com:bikash119/argilla…

fbf93e1

… into argilla_with_autotrain

davidberenstein1957 closed this Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tutorial] Token classification tutorial for USPTO claims text with HF AutoTrain #5375

[Tutorial] Token classification tutorial for USPTO claims text with HF AutoTrain #5375

bikash119 commented Aug 2, 2024

review-notebook-app bot commented Aug 2, 2024

davidberenstein1957 left a comment •

edited

Loading

bikash119 commented Aug 5, 2024

davidberenstein1957 commented Aug 7, 2024

bikash119 commented Aug 8, 2024

davidberenstein1957 commented Aug 8, 2024

bikash119 commented Aug 9, 2024 •

edited

Loading

bikash119 commented Aug 13, 2024

davidberenstein1957 commented Aug 16, 2024

davidberenstein1957 commented Aug 19, 2024 •

edited

Loading

bikash119 commented Aug 26, 2024 •

edited

Loading

davidberenstein1957 left a comment

bikash119 commented Aug 27, 2024

davidberenstein1957 commented Oct 17, 2024

[Tutorial] Token classification tutorial for USPTO claims text with HF AutoTrain #5375

[Tutorial] Token classification tutorial for USPTO claims text with HF AutoTrain #5375

Conversation

bikash119 commented Aug 2, 2024

Description

review-notebook-app bot commented Aug 2, 2024

davidberenstein1957 left a comment • edited Loading

Choose a reason for hiding this comment

bikash119 commented Aug 5, 2024

davidberenstein1957 commented Aug 7, 2024

bikash119 commented Aug 8, 2024

davidberenstein1957 commented Aug 8, 2024

bikash119 commented Aug 9, 2024 • edited Loading

bikash119 commented Aug 13, 2024

davidberenstein1957 commented Aug 16, 2024

davidberenstein1957 commented Aug 19, 2024 • edited Loading

bikash119 commented Aug 26, 2024 • edited Loading

davidberenstein1957 left a comment

Choose a reason for hiding this comment

bikash119 commented Aug 27, 2024

davidberenstein1957 commented Oct 17, 2024

davidberenstein1957 left a comment •

edited

Loading

bikash119 commented Aug 9, 2024 •

edited

Loading

davidberenstein1957 commented Aug 19, 2024 •

edited

Loading

bikash119 commented Aug 26, 2024 •

edited

Loading