Stop generation with `Continuation` when a specific string was generated #187

rlouf · 2023-07-13T11:21:02Z

Closes #151

arunpatro · 2023-07-14T20:18:39Z

tests/text/generate/test_integration_transfomers.py

    )
    assert isinstance(sequence, str)

-    prompts = ["Write a short sentence", "And another one"]
+    prompts = ["Write a short sentence ", "And another one "]


I am just curious why you added whitespace padding? According to guidance, its preferable to terminate prompts without any new space or line, because frequent tokens already come with a space before the word.

Good question, that came intuitively with GPT2 and numbers. I'll look at the vocabulary directly to see if that's actually the right thing to do.

I came back to this and looked at the vocabulary of the GPT2 tokenizer. It is true that most of the tokens begin with a space.

Your point highlights something that we need to be very careful about, and which might be incorrectly implemented in outlines.

It should not affect this PR since we're partially matching on text, so "/n" will match " /n". However the regex [a-z]{3} will allow "art" to be generated, but not " art". This could make it impossible to generate what would otherwise be the most probable completion.

I need to dig more into this. I opened #193 to keep track of my thinking on this.

Token healing (tracked by #161) should ensure that this kind of quirk doesn't affect generation. Users shouldn't have to worry about the effects of tokenization.

You are right, in that token healing should be able to correct all these nuances.

arunpatro · 2023-07-14T20:19:03Z

Looks good to me.

rlouf added text Linked to text generation enhancement labels Jul 13, 2023

rlouf added this to the 0.1 milestone Jul 13, 2023

rlouf force-pushed the continuation-stop-at branch 2 times, most recently from ec281ef to dfb6f32 Compare July 13, 2023 13:27

Stop generation with Continuation when a specific string was generated

17450ff

rlouf force-pushed the continuation-stop-at branch from dfb6f32 to 17450ff Compare July 13, 2023 13:35

rlouf marked this pull request as ready for review July 13, 2023 13:43

rlouf requested review from brandonwillard and arunpatro July 14, 2023 06:21

arunpatro reviewed Jul 14, 2023

View reviewed changes

arunpatro approved these changes Jul 14, 2023

View reviewed changes

brandonwillard approved these changes Jul 15, 2023

View reviewed changes

rlouf merged commit bfa0e94 into outlines-dev:main Jul 15, 2023
4 checks passed

rlouf deleted the continuation-stop-at branch July 17, 2023 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop generation with `Continuation` when a specific string was generated #187

Stop generation with `Continuation` when a specific string was generated #187

rlouf commented Jul 13, 2023

arunpatro Jul 14, 2023

rlouf Jul 15, 2023

rlouf Jul 17, 2023 •

edited

Loading

arunpatro Jul 17, 2023

arunpatro commented Jul 14, 2023

Stop generation with Continuation when a specific string was generated #187

Stop generation with Continuation when a specific string was generated #187

Conversation

rlouf commented Jul 13, 2023

arunpatro Jul 14, 2023

Choose a reason for hiding this comment

rlouf Jul 15, 2023

Choose a reason for hiding this comment

rlouf Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

arunpatro Jul 17, 2023

Choose a reason for hiding this comment

arunpatro commented Jul 14, 2023

Stop generation with `Continuation` when a specific string was generated #187

Stop generation with `Continuation` when a specific string was generated #187

rlouf Jul 17, 2023 •

edited

Loading