Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix single letter stop strings #31448

Merged
merged 5 commits into from
Jun 18, 2024
Merged

Conversation

Rocketknight1
Copy link
Member

@Rocketknight1 Rocketknight1 commented Jun 17, 2024

Our StopStringCriteria has a bug when all stop strings are one letter long. This results in a case where no stop strings have any valid_positions, as this variable only tracks internal positions, and this causes a max() to fail later.

This code fixes the issue by ensuring that max() takes a minimum value of 1. It also adds a test for single-letter stop strings!

Fixes #31435

cc @amyeroberts @zucchini-nlp @gante

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@amyeroberts
Copy link
Collaborator

Our StopStringCriteria has a bug when all stop strings are one letter long. This results in a case where no stop strings have any valid_positions, as this variable only tracks internal positions, and this causes a max() to fail later.

@Rocketknight1 Could you explain this a bit more? Why does this result in no stop strings having any valid_positions? i.e. what does "this variable only tracks internal positions" mean?

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

src/transformers/generation/stopping_criteria.py Outdated Show resolved Hide resolved
tests/generation/test_stopping_criteria.py Outdated Show resolved Hide resolved
@Rocketknight1
Copy link
Member Author

@amyeroberts Sorry for the confusion! The way this works is that we precompute two things for every (token, stop_string) pair: The length of the token's overlap with the end of the stop string, which we use to begin matches, and the positions inside the stop string where the token can match that don't overlap with the end.

The max() in the code was implicitly assuming that at least one (token, stop_string) pair would have a valid non-end position (a position inside the string that the token matches that does not overlap with the end of the stop string), but there are cases when this isn't true, and the most common one is when all the stop strings are a single character, which can only have end-matches.

Copy link

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for quickly addressing!

Code changes look good, but I cannot test due to related bug:

E               ValueError: There are one or more stop strings, either in the arguments to `generate` or in the model's generation config, but we could not locate a tokenizer. When generating with stop strings, you must pass the model's tokenizer to the `tokenizer` argument of `generate`.

Can we get rid of the second tokenizer = kwargs.pop("tokenizer", None) as well in this PR?

https://github.com/huggingface/transformers/blob/eed9ed67987/src/transformers/generation/utils.py#L1643

@gante
Copy link
Member

gante commented Jun 17, 2024

@lapp0 that issue was fixed today (#31427), if you rebase on main and apply this PR on top of it it should work 🤗

Copy link

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @Rocketknight1 LGTM

Thanks so much for the tokenizer popping fix @gante

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

tests/generation/test_stopping_criteria.py Outdated Show resolved Hide resolved
@Rocketknight1 Rocketknight1 force-pushed the fix_single_letter_stop_strings branch from fe4bcc9 to 2c27f3d Compare June 18, 2024 12:53
@Rocketknight1 Rocketknight1 merged commit 28316d0 into main Jun 18, 2024
22 checks passed
@Rocketknight1 Rocketknight1 deleted the fix_single_letter_stop_strings branch June 18, 2024 13:07
itazap pushed a commit that referenced this pull request Jun 18, 2024
* Fix single letter stop strings

* Change the 0 to a 1 to avoid potential empty vector headaches later

* Restructure for clarity

* Update tests/generation/test_stopping_criteria.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Add the unsqueeze

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
itazap pushed a commit that referenced this pull request Jun 20, 2024
* Fix single letter stop strings

* Change the 0 to a 1 to avoid potential empty vector headaches later

* Restructure for clarity

* Update tests/generation/test_stopping_criteria.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Add the unsqueeze

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants