Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing spaces in Tokenizer.convert_token_to_string #280

Merged

Conversation

brandonwillard
Copy link
Contributor

Closes #273

@brandonwillard brandonwillard marked this pull request as ready for review September 15, 2023 20:43
@brandonwillard brandonwillard added the transformers Linked to the `transformers` integration label Sep 15, 2023
@brandonwillard
Copy link
Contributor Author

This probably isn't a general enough fix, but it should be a workable stand-in for Llama models right now, so I'm going to merge this. @AL-377, still feel free to put in a PR for another approach, especially if you think it will cover more cases.

@brandonwillard brandonwillard merged commit 702bbe7 into outlines-dev:main Sep 15, 2023
4 checks passed
@brandonwillard brandonwillard deleted the fix-llama-tokenizer-spaces branch September 15, 2023 20:51
@AL-377
Copy link
Contributor

AL-377 commented Sep 16, 2023

OK,thanks for your effort!I have tried the codellama model and the result remains the white space,I will keep trying

@brandonwillard
Copy link
Contributor Author

OK,thanks for your effort!I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

@AL-377
Copy link
Contributor

AL-377 commented Sep 16, 2023

OK,thanks for your effort!I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

transformers.models.code_llama.tokenization_code_llama_fast.CodeLlamaTokenizerFast

@brandonwillard
Copy link
Contributor Author

OK,thanks for your effort!I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

transformers.models.code_llama.tokenization_code_llama_fast.CodeLlamaTokenizerFast

Thanks! We just need to add CodeLlamaTokenizerFast to the isinstance check in this PR.

@AL-377
Copy link
Contributor

AL-377 commented Sep 16, 2023

OK,thanks for your effort!I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

transformers.models.code_llama.tokenization_code_llama_fast.CodeLlamaTokenizerFast

Thanks! We just need to add CodeLlamaTokenizerFast to the isinstance check in this PR.

But actually I have tried the code:

self.is_sentencepiece = isinstance(
            self.tokenizer, (LlamaTokenizerFast, LlamaTokenizer,CodeLlamaTokenizer,CodeLlamaTokenizerFast)
        )
    And I got the error below:
 │ /outlines/text/generate/regex.py:118 in create_proposal                  │
│                                                                                                  │
│   115 │   │   │   │   │                                                                          │
│   116 │   │   │   │   │   sequence = self.model.tokenizer.decode(readable_tokens)                │
│   117 │   │   │   │   │                                                                          │
│ ❱ 118 │   │   │   │   │   ((_, state_seq),) = find_partial_matches(                              │
│   119 │   │   │   │   │   │   self.regex_fsm,                                                    │
│   120 │   │   │   │   │   │   "".join(sequence),                                                 │
│   121 │   │   │   │   │   │   start_state=last_fsm_state,                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: not enough values to unpack (expected 1, got 0)

My test.py is just the same as the example at https://github.com/outlines-dev/outlines#efficient-json-generation-following-a-pydantic-model

@brandonwillard
Copy link
Contributor Author

OK,thanks for your effort!I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

transformers.models.code_llama.tokenization_code_llama_fast.CodeLlamaTokenizerFast

Thanks! We just need to add CodeLlamaTokenizerFast to the isinstance check in this PR.

But actually I have tried the code:

self.is_sentencepiece = isinstance(
            self.tokenizer, (LlamaTokenizerFast, LlamaTokenizer,CodeLlamaTokenizer,CodeLlamaTokenizerFast)
        )
    And I got the error below:
 │ /outlines/text/generate/regex.py:118 in create_proposal                  │
│                                                                                                  │
│   115 │   │   │   │   │                                                                          │
│   116 │   │   │   │   │   sequence = self.model.tokenizer.decode(readable_tokens)                │
│   117 │   │   │   │   │                                                                          │
│ ❱ 118 │   │   │   │   │   ((_, state_seq),) = find_partial_matches(                              │
│   119 │   │   │   │   │   │   self.regex_fsm,                                                    │
│   120 │   │   │   │   │   │   "".join(sequence),                                                 │
│   121 │   │   │   │   │   │   start_state=last_fsm_state,                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: not enough values to unpack (expected 1, got 0)

My test.py is just the same as the example at https://github.com/outlines-dev/outlines#efficient-json-generation-following-a-pydantic-model

Yeah, that looks like a different issue with the Llama models altogether. I'll check it out.

@brandonwillard
Copy link
Contributor Author

brandonwillard commented Sep 16, 2023

That issue is due to tokens that correspond to empty strings (see #273 (comment)). The general issue of empty strings and our regex sampling logic might go beyond this Llama-specific tokenization issue, though. I address it directly in #272, which replaces and simplifies all the regex logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug transformers Linked to the `transformers` integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Llama Models decoding produces white spaces between characters
2 participants