Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in outlines.generate.choice: create_states_mapping throws ValueError: not enough values to unpack (expected 3, got 2) #585

Closed
dnhkng opened this issue Jan 25, 2024 · 9 comments
Labels
bug structured generation Linked to structured generation

Comments

@dnhkng
Copy link
Contributor

dnhkng commented Jan 25, 2024

Describe the issue as clearly as possible:

When I try the examples on the github front page, some do not work from a fresh conda environment.

Steps/code to reproduce the bug:

import outlines

model = outlines.models.transformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda")

prompt = "1+1="
answer = outlines.generate.format(model, int)(prompt)

prompt = "sqrt(2)="

generator = outlines.generate.format(model, float)
answer = generator(prompt)

# answer is '2', a string, not a float!
# even worse:

model = outlines.models.transformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda")


prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)

Expected result:

Either "Positive" or "Negative"

Error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 10
      1 model = outlines.models.transformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda")
      4 prompt = """You are a sentiment-labelling assistant.
      5 Is the following review positive or negative?
      6 
      7 Review: This restaurant is just awesome!
      8 """
---> 10 generator = outlines.generate.choice(model, ["Positive", "Negative"])
     11 answer = generator(prompt)

File ~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:412, in choice(model, choices, max_tokens, sampler)
    405 def choice(
    406     model,
    407     choices: List[str],
    408     max_tokens: Optional[int] = None,
    409     sampler: Sampler = multinomial,
    410 ):
    411     regex_str = r"(" + r"|".join(choices) + r")"
--> 412     return regex(model, regex_str, max_tokens, sampler)

File ~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:373, in regex(model, regex_str, max_tokens, sampler)
    367 def regex(
    368     model,
    369     regex_str: str,
    370     max_tokens: Optional[int] = None,
    371     sampler: Sampler = multinomial,
    372 ):
--> 373     fsm = RegexFSM(regex_str, model.tokenizer)
    375     device = model.device
    376     generator = SequenceGenerator(fsm, model, sampler, device, max_tokens=max_tokens)

File ~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:136, in RegexFSM.__init__(self, regex_string, tokenizer)
    131     final_states = regex_fsm.finals | {
    132         -1
    133     }  # Include the EOS token in final states
    134     return states_to_token_maps, empty_token_ids, final_states
--> 136 (
    137     self.states_to_token_maps,
    138     self.empty_token_ids,
    139     self.final_states,
    140 ) = create_states_mapping(
    141     regex_string, tuple(sorted(tokenizer.vocabulary.items()))
    142 )
    143 self.num_tokens_generated = 0
    144 self.vocabulary = tokenizer.vocabulary.values()

ValueError: not enough values to unpack (expected 3, got 2)

Outlines/Python version information:

Version information

``` (command output here) ```

Context for the issue:

This is a very weird bug! If I run the "outlines.generate.format" code, very occasionally, I also get the "outlines.generate.choice" method to run too! But 99% of the time, I get this error.

I did some digging, and added some debug code:

        x = create_states_mapping(
            regex_string, tuple(sorted(tokenizer.vocabulary.items()))
        )
        print("Output tuple:")
        for item in x:
            print(f'{item=}')
        self.states_to_token_maps, self.empty_token_ids, self.final_states = x

When I run the working code, I see:

Output tuple:
item={0: {59: 3, 52: 3, 29945: 3, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 29953: 3, 48: 1, 46: 1, 29974: 1, 29899: 1, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 51: 2, 57: 3, 29941: 3, 29900: 2}, 1: {59: 3, 52: 3, 29945: 3, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 29953: 3, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 51: 2, 57: 3, 29941: 3, 29900: 2}, 2: {29872: 5, 72: 5, 29889: 4, 2: 2, 104: 5, 49: 4, 29923: 5}, 3: {59: 3, 29889: 4, 52: 3, 29945: 3, 49: 4, 72: 5, 104: 5, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 51: 3, 29953: 3, 29923: 5, 29872: 5, 29900: 3, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 2: 3, 57: 3, 29941: 3}, 4: {29955: 8, 55: 8, 29946: 8, 29906: 8, 57: 8, 29941: 8, 59: 8, 52: 8, 29945: 8, 54: 8, 29947: 8, 58: 8, 56: 8, 60: 8, 29896: 8, 29929: 8, 51: 8, 29953: 8, 53: 8, 29900: 8}, 5: {29899: 6, 48: 6, 29974: 6, 46: 6}, 6: {54: 7, 29947: 7, 56: 7, 29896: 7, 58: 7, 29929: 7, 60: 7, 29953: 7, 51: 7, 29900: 7, 29955: 7, 53: 7, 29946: 7, 55: 7, 29906: 7, 57: 7, 29941: 7, 59: 7, 52: 7, 29945: 7}, 7: {54: 7, 29947: 7, 56: 7, 29896: 7, 58: 7, 29929: 7, 60: 7, 29953: 7, 51: 7, 29900: 7, 29955: 7, 53: 7, 29946: 7, 55: 7, 29906: 7, 2: 7, 57: 7, 29941: 7, 59: 7, 52: 7, 29945: 7}, 8: {29955: 8, 55: 8, 29946: 8, 29906: 8, 2: 8, 57: 8, 29941: 8, 72: 5, 59: 8, 29923: 5, 52: 8, 29945: 8, 29872: 5, 54: 8, 29947: 8, 58: 8, 56: 8, 60: 8, 29896: 8, 29929: 8, 51: 8, 29953: 8, 104: 5, 53: 8, 29900: 8}}
item=set()
item=frozenset({2, 3, 7, 8, -1})
'1'

But the buggy code produces:

Output tuple:
item={0: {29925: 2, 81: 1, 29940: 1, 9135: 4, 83: 2, 8139: 10, 9837: 3}, 1: {387: 11, 2442: 5, 29872: 10, 104: 10}, 2: {8156: 5, 114: 3, 29877: 3, 359: 4}, 3: {29879: 4, 1039: 5, 118: 4}, 4: {29875: 5, 4812: 7, 3321: 9, 108: 5, 277: 6}, 5: {2034: 7, 119: 6, 29873: 6}, 6: {440: 8, 573: 9, 29875: 7, 108: 7}, 7: {29894: 8, 345: 9, 121: 8}, 8: {29872: 9, 104: 9}, 10: {106: 11, 3249: 5, 29887: 11, 28818: 6}, 11: {1230: 9, 271: 6, 2219: 7, 1926: 8, 29874: 5, 100: 5}}
item=set()
.
.
.
File [~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:149](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:149), in RegexFSM.__init__(self, regex_string, tokenizer)
    [147](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:147) for item in x:
    [148](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:148)     print(f'{item=}')
--> [149](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:149) self.states_to_token_maps, self.empty_token_ids, self.final_states = x
    [150](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:150) self.num_tokens_generated = 0
    [151](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:151) self.vocabulary = tokenizer.vocabulary.values()

ValueError: not enough values to unpack (expected 3, got 2)

So, the function "create_states_mapping" is not returning the frozenset, so the tuple only has 2 on the 3 items to unpack!

@dnhkng dnhkng added the bug label Jan 25, 2024
@lapp0
Copy link
Contributor

lapp0 commented Jan 25, 2024

Could you please include your version info in the version section? There was a change recently which may have fixed this

python -c "from outlines import _version; print(_version.version)"
python -c "import sys; print('Python', sys.version)"
pip freeze

There's a good chance upgrading to latest (unreleased) 0.0.25 would fix this

pip install outlines git+https://github.com/outlines-dev/outlines

@dnhkng
Copy link
Contributor Author

dnhkng commented Jan 25, 2024

I was on 0.0.24
I can confirm that 0.0.25 fixes the issue with "outlines.generate.choice"...

But now "outlines.generate.format" throws the same kind of error!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], [line 3](vscode-notebook-cell:?execution_count=3&line=3)
      [1](vscode-notebook-cell:?execution_count=3&line=1) prompt = "sqrt(2)="
----> [3](vscode-notebook-cell:?execution_count=3&line=3) generator = outlines.generate.format(model, float)
      [4](vscode-notebook-cell:?execution_count=3&line=4) answer = generator(prompt)

File [~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:396](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:396), in format(model, python_type, max_tokens, sampler)
    [392](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:392) def format(
    [393](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:393)     model, python_type, max_tokens: Optional[int] = None, sampler: Sampler = multinomial
    [394](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:394) ):
    [395](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:395)     regex_str = python_types_to_regex(python_type)
--> [396](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:396)     return regex(model, regex_str, max_tokens, sampler)

File [~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:370](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:370), in regex(model, regex_str, max_tokens, sampler)
    [364](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:364) def regex(
    [365](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:365)     model,
    [366](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:366)     regex_str: str,
    [367](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:367)     max_tokens: Optional[int] = None,
    [368](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:368)     sampler: Sampler = multinomial,
    [369](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:369) ):
--> [370](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:370)     fsm = RegexFSM(regex_str, model.tokenizer)
    [372](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:372)     device = model.device
    [373](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:373)     generator = SequenceGenerator(fsm, model, sampler, device, max_tokens=max_tokens)

File [~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:120](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:120), in RegexFSM.__init__(self, regex_string, tokenizer)
    [114](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:114)         raise ValueError(
    [115](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:115)             "The vocabulary does not allow us to build a sequence that matches the input regex"
    [116](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:116)         )
    [118](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:118)     return states_to_token_maps, empty_token_ids
--> [120](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:120) self.states_to_token_maps, self.empty_token_ids = create_states_mapping(
    [121](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:121)     regex_string, tuple(sorted(tokenizer.vocabulary.items()))
    [122](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:122) )
    [123](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:123) self.vocabulary = tokenizer.vocabulary.values()
    [124](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:124) self.eos_token_id = tokenizer.eos_token_id

ValueError: too many values to unpack (expected 2)

This is very weird, as I see that "create_states_mapping" should return only two objects: states_to_token_maps and empty_token_ids. But when I print what is returned, I see its 3 objects:
({0: {59: 3, 52: 3, 29945: 3, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 29953: 3, 48: 1, 46: 1, 29974: 1, 29899: 1, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 51: 2, 57: 3, 29941: 3, 29900: 2}, 1: {59: 3, 52: 3, 29945: 3, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 29953: 3, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 51: 2, 57: 3, 29941: 3, 29900: 2}, 2: {29872: 5, 72: 5, 29889: 4, 2: 2, 104: 5, 49: 4, 29923: 5}, 3: {59: 3, 29889: 4, 52: 3, 29945: 3, 49: 4, 72: 5, 104: 5, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 51: 3, 29953: 3, 29923: 5, 29872: 5, 29900: 3, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 2: 3, 57: 3, 29941: 3}, 4: {29955: 8, 55: 8, 29946: 8, 29906: 8, 57: 8, 29941: 8, 59: 8, 52: 8, 29945: 8, 54: 8, 29947: 8, 58: 8, 56: 8, 60: 8, 29896: 8, 29929: 8, 51: 8, 29953: 8, 53: 8, 29900: 8}, 5: {29899: 6, 48: 6, 29974: 6, 46: 6}, 6: {54: 7, 29947: 7, 56: 7, 29896: 7, 58: 7, 29929: 7, 60: 7, 29953: 7, 51: 7, 29900: 7, 29955: 7, 53: 7, 29946: 7, 55: 7, 29906: 7, 57: 7, 29941: 7, 59: 7, 52: 7, 29945: 7}, 7: {54: 7, 29947: 7, 56: 7, 29896: 7, 58: 7, 29929: 7, 60: 7, 29953: 7, 51: 7, 29900: 7, 29955: 7, 53: 7, 29946: 7, 55: 7, 29906: 7, 2: 7, 57: 7, 29941: 7, 59: 7, 52: 7, 29945: 7}, 8: {29955: 8, 55: 8, 29946: 8, 29906: 8, 2: 8, 57: 8, 29941: 8, 72: 5, 59: 8, 29923: 5, 52: 8, 29945: 8, 29872: 5, 54: 8, 29947: 8, 58: 8, 56: 8, 60: 8, 29896: 8, 29929: 8, 51: 8, 29953: 8, 104: 5, 53: 8, 29900: 8}}, set(), frozenset({2, 3, 7, 8, -1}))

Running the outlines.generate.choice method returns 2 objects correctly, the dictionary, and the set.

@dnhkng
Copy link
Contributor Author

dnhkng commented Jan 25, 2024

Maybe found a quick fix: commenting out the @cache seems to fix this!

i.e.

class RegexFSM(FSM):
    """FSM to generate text that is in the language of a regular expression."""

    def __init__(self, regex_string: str, tokenizer: "Tokenizer"):
        # @cache()
        def create_states_mapping(

Not sure how this affects performance though, but:

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])

for i in range(100):
    answer = generator(prompt)

With cache commented out:
4.38 s ± 429 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With cache:
4.06 s ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@lapp0
Copy link
Contributor

lapp0 commented Jan 25, 2024

#566 should have fixed this. It invalidates the cache if the version is upgraded.

Can you confirm which version you are running via from outlines import _version; print(_version.version) (this command will include the git revision which is helpful for me)

@dnhkng
Copy link
Contributor Author

dnhkng commented Jan 25, 2024

I'm on: 0.0.25.dev15+g0cd9608

@lapp0
Copy link
Contributor

lapp0 commented Jan 25, 2024

I found the source of the issue. Outlines cache is cleared if there's a version upgrade, however installing from git via pip doesn't seem to set the version in the same way that pip install . from the repo directory does.

root@C.8986380:~$ pip install outlines git+https://github.com/outlines-dev/outlines -q
root@C.8986380:~$ python3 -c "from outlines._version import __version__ as outlines_version; print(outlines_version)"
0.0.24

We need to ensure the version in from outlines._version import __version__ is distinct even if installed from pip. Thanks for helping us discover this!


@dnhkng as a temporary fix I recommend running rm -rf ~/.cache/outlines

@lapp0
Copy link
Contributor

lapp0 commented Jan 25, 2024

Best route forward IMO:

  • Upgrade outlines to 0.0.25 to incorporate the cache invalidation fix in an official release (@rlouf)
  • Diagnose setuptools_scm issue (I think it's upstream upon brief review), and recommend installation of prereleases via git clone <>, pip install . in the mean-time.

@rlouf rlouf added the structured generation Linked to structured generation label Jan 26, 2024
@lapp0
Copy link
Contributor

lapp0 commented Jan 26, 2024

Works in my environment. @dnhkng could you please confirm your reproduction code no longer fails in your conda environment if you run

rm -rf ~/.cache/outlines
pip install outlines==0.0.24
python3 your_script_in_original_post.py
pip install outlines==0.0.25
python3 your_script_in_original_post.py

@dnhkng
Copy link
Contributor Author

dnhkng commented Jan 26, 2024

Looks ok now!

I'm still getting strings instead of floats, but I've raised a separate issue for that.

@dnhkng dnhkng closed this as completed Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug structured generation Linked to structured generation
Projects
None yet
Development

No branches or pull requests

3 participants