Fix whitespace and control character handling in JSON guidance #283

brandonwillard · 2023-09-16T06:06:13Z

This PR allows whitespace characters according to the JSON grammar and disallows control characters in strings.

It also cleans up the project dependencies and pins beartype to a version below 0.16.0, because that version appears to be out of sync with its dependent package pytorch.

AL-377 · 2023-09-16T07:53:59Z

I have tried the codellama with the latest version outlines，it seems that the white space is fixed. But sometimes it comes up with an error as followed：

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/jeeves/pr/test.py:43 in <module>                                                           │
│                                                                                                  │
│   40                                                                                             │
│   41                                                                                             │
│   42 generator = generate.json(model, Character)                                                 │
│ ❱ 43 sequence = generator("Give me a character description")                                     │
│   44 print(sequence)                                                                             │
│   45                                                                                             │
│   46                                                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27 in decorate_context       │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/jeeves/pr/outlines/outlines/text/generate/sequence.py:225 in __call__                      │
│                                                                                                  │
│   222 │   │   │   if torch.all(is_finished) or num_generated_tokens == self.max_tokens:          │
│   223 │   │   │   │   break                                                                      │
│   224 │   │   │                                                                                  │
│ ❱ 225 │   │   │   updated_token_ids, _ = self.step(                                              │
│   226 │   │   │   │   rng,                                                                       │
│   227 │   │   │   │   num_prompt_tokens,                                                         │
│   228 │   │   │   │   token_ids[~is_finished],                                                   │
│                                                                                                  │
│ /home/jeeves/pr/outlines/outlines/text/generate/sequence.py:82 in step                           │
│                                                                                                  │
│    79 │   │   """                                                                                │
│    80 │   │   num_input_dims = token_ids.ndim                                                    │
│    81 │   │   probs = self.model(token_ids, attention_mask)                                      │
│ ❱  82 │   │   probs = self.create_proposal(token_ids[:, num_prompt_tokens:], probs)              │
│    83 │   │   probs = torch.nn.functional.softmax(probs, dim=-1)                                 │
│    84 │   │                                                                                      │
│    85 │   │   # Sample `samples`-many new tokens.                                                │
│                                                                                                  │
│ /home/jeeves/pr/outlines/outlines/text/generate/regex.py:118 in create_proposal                  │
│                                                                                                  │
│   115 │   │   │   │   │                                                                          │
│   116 │   │   │   │   │   sequence = self.model.tokenizer.decode(readable_tokens)                │
│   117 │   │   │   │   │                                                                          │
│ ❱ 118 │   │   │   │   │   ((_, state_seq),) = find_partial_matches(                              │
│   119 │   │   │   │   │   │   self.regex_fsm,                                                    │
│   120 │   │   │   │   │   │   "".join(sequence),                                                 │
│   121 │   │   │   │   │   │   start_state=last_fsm_state,

brandonwillard · 2023-09-16T20:32:27Z

I have tried the codellama with the latest version outlines，it seems that the white space is fixed. But sometimes it comes up with an error as followed：

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/jeeves/pr/test.py:43 in <module>                                                           │
│                                                                                                  │
│   40                                                                                             │
│   41                                                                                             │
│   42 generator = generate.json(model, Character)                                                 │
│ ❱ 43 sequence = generator("Give me a character description")                                     │
│   44 print(sequence)                                                                             │
│   45                                                                                             │
│   46                                                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27 in decorate_context       │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/jeeves/pr/outlines/outlines/text/generate/sequence.py:225 in __call__                      │
│                                                                                                  │
│   222 │   │   │   if torch.all(is_finished) or num_generated_tokens == self.max_tokens:          │
│   223 │   │   │   │   break                                                                      │
│   224 │   │   │                                                                                  │
│ ❱ 225 │   │   │   updated_token_ids, _ = self.step(                                              │
│   226 │   │   │   │   rng,                                                                       │
│   227 │   │   │   │   num_prompt_tokens,                                                         │
│   228 │   │   │   │   token_ids[~is_finished],                                                   │
│                                                                                                  │
│ /home/jeeves/pr/outlines/outlines/text/generate/sequence.py:82 in step                           │
│                                                                                                  │
│    79 │   │   """                                                                                │
│    80 │   │   num_input_dims = token_ids.ndim                                                    │
│    81 │   │   probs = self.model(token_ids, attention_mask)                                      │
│ ❱  82 │   │   probs = self.create_proposal(token_ids[:, num_prompt_tokens:], probs)              │
│    83 │   │   probs = torch.nn.functional.softmax(probs, dim=-1)                                 │
│    84 │   │                                                                                      │
│    85 │   │   # Sample `samples`-many new tokens.                                                │
│                                                                                                  │
│ /home/jeeves/pr/outlines/outlines/text/generate/regex.py:118 in create_proposal                  │
│                                                                                                  │
│   115 │   │   │   │   │                                                                          │
│   116 │   │   │   │   │   sequence = self.model.tokenizer.decode(readable_tokens)                │
│   117 │   │   │   │   │                                                                          │
│ ❱ 118 │   │   │   │   │   ((_, state_seq),) = find_partial_matches(                              │
│   119 │   │   │   │   │   │   self.regex_fsm,                                                    │
│   120 │   │   │   │   │   │   "".join(sequence),                                                 │
│   121 │   │   │   │   │   │   start_state=last_fsm_state,

Yeah, see #280 (comment). You can try it again with the changes in #272, and you shouldn't see that issue; however, if you do, or see another issue, don't hesitate to report it!

AL-377 · 2023-09-20T15:50:59Z

I have tried the codellama with the latest version outlines，it seems that the white space is fixed. But sometimes it comes up with an error as followed：

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/jeeves/pr/test.py:43 in <module>                                                           │
│                                                                                                  │
│   40                                                                                             │
│   41                                                                                             │
│   42 generator = generate.json(model, Character)                                                 │
│ ❱ 43 sequence = generator("Give me a character description")                                     │
│   44 print(sequence)                                                                             │
│   45                                                                                             │
│   46                                                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27 in decorate_context       │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/jeeves/pr/outlines/outlines/text/generate/sequence.py:225 in __call__                      │
│                                                                                                  │
│   222 │   │   │   if torch.all(is_finished) or num_generated_tokens == self.max_tokens:          │
│   223 │   │   │   │   break                                                                      │
│   224 │   │   │                                                                                  │
│ ❱ 225 │   │   │   updated_token_ids, _ = self.step(                                              │
│   226 │   │   │   │   rng,                                                                       │
│   227 │   │   │   │   num_prompt_tokens,                                                         │
│   228 │   │   │   │   token_ids[~is_finished],                                                   │
│                                                                                                  │
│ /home/jeeves/pr/outlines/outlines/text/generate/sequence.py:82 in step                           │
│                                                                                                  │
│    79 │   │   """                                                                                │
│    80 │   │   num_input_dims = token_ids.ndim                                                    │
│    81 │   │   probs = self.model(token_ids, attention_mask)                                      │
│ ❱  82 │   │   probs = self.create_proposal(token_ids[:, num_prompt_tokens:], probs)              │
│    83 │   │   probs = torch.nn.functional.softmax(probs, dim=-1)                                 │
│    84 │   │                                                                                      │
│    85 │   │   # Sample `samples`-many new tokens.                                                │
│                                                                                                  │
│ /home/jeeves/pr/outlines/outlines/text/generate/regex.py:118 in create_proposal                  │
│                                                                                                  │
│   115 │   │   │   │   │                                                                          │
│   116 │   │   │   │   │   sequence = self.model.tokenizer.decode(readable_tokens)                │
│   117 │   │   │   │   │                                                                          │
│ ❱ 118 │   │   │   │   │   ((_, state_seq),) = find_partial_matches(                              │
│   119 │   │   │   │   │   │   self.regex_fsm,                                                    │
│   120 │   │   │   │   │   │   "".join(sequence),                                                 │
│   121 │   │   │   │   │   │   start_state=last_fsm_state,

Yeah, see #280 (comment). You can try it again with the changes in #272, and you shouldn't see that issue; however, if you do, or see another issue, don't hesitate to report it!

Sorry but the problems remains : │ /home/jeeves/pr/outlines/outlines/text/generate/regex.py:118 in create_proposal │
│ │
│ 115 │ │ │ │ │ │
│ 116 │ │ │ │ │ sequence = self.model.tokenizer.decode(readable_tokens) │
│ 117 │ │ │ │ │ │
│ ❱ 118 │ │ │ │ │ ((_, state_seq),) = find_partial_matches( │
│ 119 │ │ │ │ │ │ self.regex_fsm, │
│ 120 │ │ │ │ │ │ "".join(sequence), │
│ 121 │ │ │ │ │ │ start_state=last_fsm_state, │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: not enough values to unpack (expected 1, got 0)

AL-377 · 2023-09-20T16:03:58Z

I have tried the codellama with the latest version outlines，it seems that the white space is fixed. But sometimes it comes up with an error as followed：

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/XX/pr/test.py:43 in <module>                                                           │
│                                                                                                  │
│   40                                                                                             │
│   41                                                                                             │
│   42 generator = generate.json(model, Character)                                                 │
│ ❱ 43 sequence = generator("Give me a character description")                                     │
│   44 print(sequence)                                                                             │
│   45                                                                                             │
│   46                                                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27 in decorate_context       │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/XX/pr/outlines/outlines/text/generate/sequence.py:225 in __call__                      │
│                                                                                                  │
│   222 │   │   │   if torch.all(is_finished) or num_generated_tokens == self.max_tokens:          │
│   223 │   │   │   │   break                                                                      │
│   224 │   │   │                                                                                  │
│ ❱ 225 │   │   │   updated_token_ids, _ = self.step(                                              │
│   226 │   │   │   │   rng,                                                                       │
│   227 │   │   │   │   num_prompt_tokens,                                                         │
│   228 │   │   │   │   token_ids[~is_finished],                                                   │
│                                                                                                  │
│ /home/XX/pr/outlines/outlines/text/generate/sequence.py:82 in step                           │
│                                                                                                  │
│    79 │   │   """                                                                                │
│    80 │   │   num_input_dims = token_ids.ndim                                                    │
│    81 │   │   probs = self.model(token_ids, attention_mask)                                      │
│ ❱  82 │   │   probs = self.create_proposal(token_ids[:, num_prompt_tokens:], probs)              │
│    83 │   │   probs = torch.nn.functional.softmax(probs, dim=-1)                                 │
│    84 │   │                                                                                      │
│    85 │   │   # Sample `samples`-many new tokens.                                                │
│                                                                                                  │
│ /home/XX/pr/outlines/outlines/text/generate/regex.py:118 in create_proposal                  │
│                                                                                                  │
│   115 │   │   │   │   │                                                                          │
│   116 │   │   │   │   │   sequence = self.model.tokenizer.decode(readable_tokens)                │
│   117 │   │   │   │   │                                                                          │
│ ❱ 118 │   │   │   │   │   ((_, state_seq),) = find_partial_matches(                              │
│   119 │   │   │   │   │   │   self.regex_fsm,                                                    │
│   120 │   │   │   │   │   │   "".join(sequence),                                                 │
│   121 │   │   │   │   │   │   start_state=last_fsm_state,

Yeah, see #280 (comment). You can try it again with the changes in #272, and you shouldn't see that issue; however, if you do, or see another issue, don't hesitate to report it!

Sorry but the problems remains : │ /home/XX/pr/outlines/outlines/text/generate/regex.py:118 in create_proposal │ │ │ │ 115 │ │ │ │ │ │ │ 116 │ │ │ │ │ sequence = self.model.tokenizer.decode(readable_tokens) │ │ 117 │ │ │ │ │ │ │ ❱ 118 │ │ │ │ │ ((_, state_seq),) = find_partial_matches( │ │ 119 │ │ │ │ │ │ self.regex_fsm, │ │ 120 │ │ │ │ │ │ "".join(sequence), │ │ 121 │ │ │ │ │ │ start_state=last_fsm_state, │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: not enough values to unpack (expected 1, got 0)

And my main.py is as below:from enum import Enum
from pydantic import BaseModel, constr

import sys
sys.path.append('outlines')

import outlines.models as models
import outlines.text.generate as generate

import torch

class Weapon(str, Enum):
sword = "sword"
axe = "axe"
mace = "mace"
spear = "spear"
bow = "bow"
crossbow = "crossbow"

class Armor(str, Enum):
leather = "leather"
chainmail = "chainmail"
plate = "plate"

class Character(BaseModel):
name: constr(max_length=10)
age: int
armor: Armor
weapon: Weapon
strength: int

model_path = "/mnt/data/user/models/CodeLlama-7b-hf"

model = models.transformers(model_path,device="cuda")

generator = generate.json(model, Character)
sequence = generator("Give me a character description")
print(sequence)

sequence = generator("Give me a interesting character description")
print(sequence)

brandonwillard · 2023-09-20T16:06:57Z

Those tracebacks are from the current code in main and/or a previous release. You'll need to install a copy of the #272 branch (e.g. pip install git+https://github.com/brandonwillard/outlines.git@numba-fsa-implementation).

AL-377 · 2023-09-21T00:49:09Z

fsa

Yes,I have tried the #272 branch, and the problem remains the same,I am not sure why "not enough values to unpack (expected 1, got 0)"

brandonwillard · 2023-09-23T17:22:43Z

find_partial_matches

If find_partial_matches is showing up in your traces, then the changes in #272 aren't being used, because that was removed from the regex implementation.

brandonwillard added bug enhancement structured generation Linked to structured generation labels Sep 16, 2023

brandonwillard self-assigned this Sep 16, 2023

brandonwillard force-pushed the fix-json-whitespace-ctrl-chars branch 6 times, most recently from f9cd7e6 to f1e925f Compare September 16, 2023 06:45

brandonwillard added 2 commits September 16, 2023 01:46

Update dependencies and pin beartype in testing

74edfdd

Fix whitespace and control character handling in JSON guidance

7b088f4

brandonwillard force-pushed the fix-json-whitespace-ctrl-chars branch from f1e925f to 7b088f4 Compare September 16, 2023 06:46

brandonwillard merged commit ff4ebb3 into outlines-dev:main Sep 16, 2023
4 checks passed

brandonwillard deleted the fix-json-whitespace-ctrl-chars branch September 16, 2023 07:10

brandonwillard linked an issue Sep 16, 2023 that may be closed by this pull request

Llama Models decoding produces white spaces between characters #273

Closed

brandonwillard removed a link to an issue Sep 16, 2023

Llama Models decoding produces white spaces between characters #273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix whitespace and control character handling in JSON guidance #283

Fix whitespace and control character handling in JSON guidance #283

brandonwillard commented Sep 16, 2023 •

edited

Loading

AL-377 commented Sep 16, 2023

brandonwillard commented Sep 16, 2023

AL-377 commented Sep 20, 2023

AL-377 commented Sep 20, 2023 •

edited

Loading

brandonwillard commented Sep 20, 2023

AL-377 commented Sep 21, 2023

brandonwillard commented Sep 23, 2023

Fix whitespace and control character handling in JSON guidance #283

Fix whitespace and control character handling in JSON guidance #283

Conversation

brandonwillard commented Sep 16, 2023 • edited Loading

AL-377 commented Sep 16, 2023

brandonwillard commented Sep 16, 2023

AL-377 commented Sep 20, 2023

AL-377 commented Sep 20, 2023 • edited Loading

brandonwillard commented Sep 20, 2023

AL-377 commented Sep 21, 2023

brandonwillard commented Sep 23, 2023

brandonwillard commented Sep 16, 2023 •

edited

Loading

AL-377 commented Sep 20, 2023 •

edited

Loading