Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast tokenizer for BARTpho #17254

Closed
wants to merge 70 commits into from
Closed

Conversation

datquocnguyen
Copy link
Contributor

This PR is to add a "fast" BARTpho tokenizer (backed by HuggingFace's tokenizers library).

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@datquocnguyen
Copy link
Contributor Author

Following: #13788
I now add a "fast" version of the BartphoTokenizer.
@sgugger , @LysandreJik, @patil-suraj , @SaulLu and @patrickvonplaten Please could you have a look and provide your feedback? Thanks.

@patil-suraj patil-suraj requested a review from SaulLu May 16, 2022 12:22
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your PR! As mentioned by @patil-suraj already, we don't rely on inheritance in Transformers but each object should be fully defined in their configuration/modeling/tokenizet file (there are some instances of subclasses for older models, but this will be cleaned up in the future).

So you should revert your changes in the slow tokenizer file to not inherit on XLMRobertaTokenizer, and make the fast version be independent of XLM-RoBERTa as well.

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved
src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved
@datquocnguyen
Copy link
Contributor Author

Hi @patil-suraj and @sgugger I revised the slow and fast BartphoTokenizer variants to satisfy your requirements.
Please have a look and give feedback. Thanks.
cc: @SaulLu @LysandreJik

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but it looks like the changes in the slow tokenizer are breaking, which we can't really do.

datquocnguyen and others added 2 commits May 17, 2022 21:43
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your contribution!

I think I personally lack context on what motivated the changes in the python version of the BartphoTokenizer tokenizer. In particular, I understand that you changed the spm model uploaded to the hub to vinai/bartpho-syllable (before it had a 250000-sized vocabulary and now it has a 40003-sized vocabulary).

Additionally, those changes are breaking for the slow tokenizer and we generally try to avoid those in transformers (cc @LysandreJik , @sgugger and @patil-suraj ) 😄

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved
src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Show resolved Hide resolved
@datquocnguyen
Copy link
Contributor Author

Please note that the unsuccessful checks are due to the failed test_modeling_wav2vec2_conformer.py, not related to our BartphoTokenizer. @SaulLu

@patrickvonplaten
Copy link
Contributor

Please note that the unsuccessful checks are due to the failed test_modeling_wav2vec2_conformer.py, not related to our BartphoTokenizer. @SaulLu

@SaulLu fixed the wav2vec2_conformer tests on master

@sgugger
Copy link
Collaborator

sgugger commented May 18, 2022

@datquocnguyen We can't merge anything that has any breaking change on the existing tokenizer, as I said before.

@datquocnguyen
Copy link
Contributor Author

@sgugger Ah, I now see your point. I initially thought the code would be much nicer if I also push a new version of the slow tokenizer. But then it breaks the existing code.

Anyway, the fast tokenizer would totally work without changing the original code of the slow tokenizer (as I already developed the fast_tokenizer_file), I think. I would need a bit of time to roll back the slow tokenizer to its original version.

(cc @SaulLu , @LysandreJik , @patil-suraj and @patrickvonplaten )

@datquocnguyen datquocnguyen changed the title Add BartphoTokenizerFast Add fast tokenizers for BARTpho, PhoBERT and BERTweet May 22, 2022
…BERT_BERTweet

Update test_tokenization_bartpho.py
…BERT_BERTweet

Merge pull request #29 from huggingface/main
Update the latest commits
@datquocnguyen
Copy link
Contributor Author

Hi @SaulLu @LysandreJik , I am wondering about the status/progress of the "sharing a custom tokenizer" feature on the hub. Is there anything I can help with? This feature would help BERTweet, PhoBERT, BARTpho and the like to be easier to be used with their fast customed tokenizers. Thank you.

@LysandreJik
Copy link
Member

The custom tokenizer should now work correctly! @ArthurZucker, if you have a spare cycle, could you look into supporting the tokenizers added here by @datquocnguyen with code on the hub using the custom tokenizers?

A guide showing how to is available here. Thanks!

@datquocnguyen
Copy link
Contributor Author

Hi @LysandreJik @ArthurZucker @SaulLu , I followed the guide, and can confirm that it works. For example, the following piece of code results in a correct fast tokenizer BertTweetTokenizerFast:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased", trust_remote_code=True, revision="ddfcf0409600519d6f8907531a65151f39be5c01")
print(tokenizer.__class__.__name__)

The current issue is that the examples have not yet included the option trust_remote_code, so they produce errors. E.g:

Traceback (most recent call last):
  File "run_ner.py", line 630, in <module>
    main()
  File "run_ner.py", line 358, in main
    add_prefix_space=True,
  File "/home/sonla/workspace/transformers/src/transformers/models/auto/tokenization_auto.py", line 587, in from_pretrained
    f"Loading {pretrained_model_name_or_path} requires you to execute the tokenizer file in that"
ValueError: Loading /home/sonla/workspace/BERTweet/bertweet-covid19-base-uncased requires you to execute the tokenizer file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.

To handle this error/issue, I have to modify the run_ner.py to include the option trust_remote_code and add this option to the tokenizer loading part. And the modified run_ner.py file now runs properly as before.

I am wondering whether there is any faster approach to handle this issue without modifying each of the examples? Thanks.

@LysandreJik
Copy link
Member

Oh great question @datquocnguyen, and thanks for taking care of the implementation! Really cool to see it works well.

@sgugger, what do you think regarding the examples? Should we add a TrainingArgument to enable specifying models with remote code? WDYT?

@sgugger
Copy link
Collaborator

sgugger commented Oct 20, 2022

It should be one of the ModelArguments defined in the example (where the rest of the args, like revision etc. lie) but yes, I don't see why not!

@datquocnguyen
Copy link
Contributor Author

The ModelArguments should have trust_remote_code_model and trust_remote_code_tokenizer separately for the model and tokenizer loading, respectively, shouldn't it? For example:

tokenizer_name_or_path = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path
    if config.model_type in {"bloom", "gpt2", "roberta"}:
        tokenizer = AutoTokenizer.from_pretrained(
            tokenizer_name_or_path,
            cache_dir=model_args.cache_dir,
            use_fast=True,
            revision=model_args.model_revision,
            trust_remote_code=trust_remote_code_tokenizer, # For tokenizer
            use_auth_token=True if model_args.use_auth_token else None,
            add_prefix_space=True,
        )
    else:
        tokenizer = AutoTokenizer.from_pretrained(
            tokenizer_name_or_path,
            cache_dir=model_args.cache_dir,
            use_fast=True,
            revision=model_args.model_revision,
            trust_remote_code=trust_remote_code_tokenizer, # For tokenizer
            use_auth_token=True if model_args.use_auth_token else None,
        )

    model = AutoModelForTokenClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        trust_remote_code=trust_remote_code_model, # For model
        use_auth_token=True if model_args.use_auth_token else None,
        ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
    )

@sgugger
Copy link
Collaborator

sgugger commented Oct 24, 2022

No, one is enough. Users that want more finegrained control can just modify the examples to suit their needs.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants