Fix missing spaces in `Tokenizer.convert_token_to_string` #280

brandonwillard · 2023-09-13T23:12:40Z

Closes #273

brandonwillard · 2023-09-15T20:51:06Z

This probably isn't a general enough fix, but it should be a workable stand-in for Llama models right now, so I'm going to merge this. @AL-377, still feel free to put in a PR for another approach, especially if you think it will cover more cases.

AL-377 · 2023-09-16T02:07:42Z

OK,thanks for your effort！I have tried the codellama model and the result remains the white space,I will keep trying

brandonwillard · 2023-09-16T02:27:22Z

OK,thanks for your effort！I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

AL-377 · 2023-09-16T02:32:38Z

OK,thanks for your effort！I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

transformers.models.code_llama.tokenization_code_llama_fast.CodeLlamaTokenizerFast

brandonwillard · 2023-09-16T03:01:20Z

OK,thanks for your effort！I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

transformers.models.code_llama.tokenization_code_llama_fast.CodeLlamaTokenizerFast

Thanks! We just need to add CodeLlamaTokenizerFast to the isinstance check in this PR.

AL-377 · 2023-09-16T03:14:13Z

OK,thanks for your effort！I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

transformers.models.code_llama.tokenization_code_llama_fast.CodeLlamaTokenizerFast

Thanks! We just need to add CodeLlamaTokenizerFast to the isinstance check in this PR.

But actually I have tried the code:

self.is_sentencepiece = isinstance(
            self.tokenizer, (LlamaTokenizerFast, LlamaTokenizer,CodeLlamaTokenizer,CodeLlamaTokenizerFast)
        )

    And I got the error below:

 │ /outlines/text/generate/regex.py:118 in create_proposal                  │
│                                                                                                  │
│   115 │   │   │   │   │                                                                          │
│   116 │   │   │   │   │   sequence = self.model.tokenizer.decode(readable_tokens)                │
│   117 │   │   │   │   │                                                                          │
│ ❱ 118 │   │   │   │   │   ((_, state_seq),) = find_partial_matches(                              │
│   119 │   │   │   │   │   │   self.regex_fsm,                                                    │
│   120 │   │   │   │   │   │   "".join(sequence),                                                 │
│   121 │   │   │   │   │   │   start_state=last_fsm_state,                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: not enough values to unpack (expected 1, got 0)

My test.py is just the same as the example at https://github.com/outlines-dev/outlines#efficient-json-generation-following-a-pydantic-model

brandonwillard · 2023-09-16T03:29:12Z

OK,thanks for your effort！I have tried the codellama model and the result remains the white space,I will keep trying

Which HF tokenizer class is it using?

transformers.models.code_llama.tokenization_code_llama_fast.CodeLlamaTokenizerFast

Thanks! We just need to add CodeLlamaTokenizerFast to the isinstance check in this PR.

But actually I have tried the code:

self.is_sentencepiece = isinstance(
            self.tokenizer, (LlamaTokenizerFast, LlamaTokenizer,CodeLlamaTokenizer,CodeLlamaTokenizerFast)
        )

    And I got the error below:

 │ /outlines/text/generate/regex.py:118 in create_proposal                  │
│                                                                                                  │
│   115 │   │   │   │   │                                                                          │
│   116 │   │   │   │   │   sequence = self.model.tokenizer.decode(readable_tokens)                │
│   117 │   │   │   │   │                                                                          │
│ ❱ 118 │   │   │   │   │   ((_, state_seq),) = find_partial_matches(                              │
│   119 │   │   │   │   │   │   self.regex_fsm,                                                    │
│   120 │   │   │   │   │   │   "".join(sequence),                                                 │
│   121 │   │   │   │   │   │   start_state=last_fsm_state,                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: not enough values to unpack (expected 1, got 0)

My test.py is just the same as the example at https://github.com/outlines-dev/outlines#efficient-json-generation-following-a-pydantic-model

Yeah, that looks like a different issue with the Llama models altogether. I'll check it out.

brandonwillard · 2023-09-16T04:15:15Z

That issue is due to tokens that correspond to empty strings (see #273 (comment)). The general issue of empty strings and our regex sampling logic might go beyond this Llama-specific tokenization issue, though. I address it directly in #272, which replaces and simplifies all the regex logic.

brandonwillard added the bug label Sep 13, 2023

brandonwillard marked this pull request as draft September 13, 2023 23:18

brandonwillard force-pushed the fix-llama-tokenizer-spaces branch from e26bff1 to 7c184ba Compare September 14, 2023 00:20

brandonwillard mentioned this pull request Sep 14, 2023

Llama Models decoding produces white spaces between characters #273

Closed

brandonwillard force-pushed the fix-llama-tokenizer-spaces branch 3 times, most recently from 7340b66 to f60a998 Compare September 15, 2023 20:38

Fix missing spaces in Tokenizer.convert_token_to_string

6fc9fa4

brandonwillard force-pushed the fix-llama-tokenizer-spaces branch from f60a998 to 6fc9fa4 Compare September 15, 2023 20:42

brandonwillard marked this pull request as ready for review September 15, 2023 20:43

brandonwillard added the transformers Linked to the `transformers` integration label Sep 15, 2023

brandonwillard merged commit 702bbe7 into outlines-dev:main Sep 15, 2023
4 checks passed

brandonwillard deleted the fix-llama-tokenizer-spaces branch September 15, 2023 20:51

brandonwillard mentioned this pull request Sep 16, 2023

Fix whitespace and control character handling in JSON guidance #283

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing spaces in `Tokenizer.convert_token_to_string` #280

Fix missing spaces in `Tokenizer.convert_token_to_string` #280

brandonwillard commented Sep 13, 2023

brandonwillard commented Sep 15, 2023

AL-377 commented Sep 16, 2023

brandonwillard commented Sep 16, 2023

AL-377 commented Sep 16, 2023

brandonwillard commented Sep 16, 2023

AL-377 commented Sep 16, 2023 •

edited

Loading

brandonwillard commented Sep 16, 2023

brandonwillard commented Sep 16, 2023 •

edited

Loading

Fix missing spaces in Tokenizer.convert_token_to_string #280

Fix missing spaces in Tokenizer.convert_token_to_string #280

Conversation

brandonwillard commented Sep 13, 2023

brandonwillard commented Sep 15, 2023

AL-377 commented Sep 16, 2023

brandonwillard commented Sep 16, 2023

AL-377 commented Sep 16, 2023

brandonwillard commented Sep 16, 2023

AL-377 commented Sep 16, 2023 • edited Loading

brandonwillard commented Sep 16, 2023

brandonwillard commented Sep 16, 2023 • edited Loading

Fix missing spaces in `Tokenizer.convert_token_to_string` #280

Fix missing spaces in `Tokenizer.convert_token_to_string` #280

AL-377 commented Sep 16, 2023 •

edited

Loading

brandonwillard commented Sep 16, 2023 •

edited

Loading