Add `Integer` sequence generator #166

rlouf · 2023-06-29T12:06:23Z

In this PR I introduce a sequence generator that only outputs integers. Closes #153.

The generator can currently generate sequences that have leading zeros, which are then trimmed during post-processing. I am not sure what we should do about it. On one hand we can ignore it (maybe the LLM has seen this during training), on the other hand we could use the pre-processing approach described in #131.

rlouf · 2023-07-06T13:53:46Z

The code now makes use of the tools introduced in #131 and the follow-up #170 to pre-compute the masks and advance the parsing state. We allow the EOS to be generated as well, and forbid leading zeros.

TODO

Move torch.nn.softmax from Transformers.__call__ to Sequence.step after creating the proposal
Make this work for batch generation

brandonwillard

Minor comments/questions; looks good otherwise.

brandonwillard · 2023-07-06T18:39:07Z

outlines/text/generate/integer.py

+        """
+        if generated_token_ids.shape[-1] > 0:
+            sampled_sequence = self.model.tokenizer.decode(generated_token_ids)
+            partial_matches = find_partial_matches(self.int_regex_fsm, sampled_sequence)


Just a reminder that we can avoid reparsing from the beginning of the sequence, but that's fine as a follow-up.

outlines/text/generate/integer.py

rlouf · 2023-07-07T14:48:23Z

The integer generation is working, although we currently cannot handle the end-of-sequence (EOS) token properly. To generate integers, we'd like the generated text to match the regex (0|[1-9][0-9]+)(<EOS>) where <EOS> is the EOS token. The corresponding finite-state machine contains the transitions "<" -> "E" -> "O" -> "S" -> ">", but we can be in a situation where for instance <EOS is in the vocabulary but > is not, stalling the generation. Even if > were part of the vocabulary the generation would still be wrong as we only want to allow transitions to the token id that corresponds to <EOS>.

Since EOS is always a final state we shouldn't have too much trouble modifying the alphabet/map of the FSM to get the desired behavior. This is non blocking, will open an issue that will need a follow-up PR.

rlouf added text Linked to text generation enhancement labels Jun 29, 2023

rlouf force-pushed the add-integer-generation branch 2 times, most recently from 88ebc06 to 47b236a Compare June 29, 2023 14:21

rlouf mentioned this pull request Jul 6, 2023

Parser-driven masking updates for basic "off-line" vocabulary pre-parsing #171

Merged

rlouf force-pushed the add-integer-generation branch 2 times, most recently from 1560beb to 9cdc0a0 Compare July 6, 2023 13:51

rlouf force-pushed the add-integer-generation branch 2 times, most recently from 1705594 to 13ad481 Compare July 6, 2023 14:53

rlouf marked this pull request as ready for review July 6, 2023 14:56

rlouf force-pushed the add-integer-generation branch 2 times, most recently from 7e7b635 to 130ec8b Compare July 6, 2023 15:04

brandonwillard approved these changes Jul 6, 2023

View reviewed changes

brandonwillard mentioned this pull request Jul 6, 2023

Add a Float generation method #154

Closed

rlouf force-pushed the add-integer-generation branch 2 times, most recently from f43cc43 to f1cb953 Compare July 7, 2023 14:22

Add Integer sequence generation method

fd98f26

rlouf force-pushed the add-integer-generation branch from f1cb953 to fd98f26 Compare July 7, 2023 14:37

rlouf merged commit 48b91ea into outlines-dev:main Jul 7, 2023
4 checks passed

rlouf deleted the add-integer-generation branch July 7, 2023 15:28

rlouf mentioned this pull request Jul 7, 2023

Add Regex generation method #175

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Integer` sequence generator #166

Add `Integer` sequence generator #166

rlouf commented Jun 29, 2023 •

edited

Loading

rlouf commented Jul 6, 2023 •

edited

Loading

brandonwillard left a comment

brandonwillard Jul 6, 2023

rlouf commented Jul 7, 2023 •

edited

Loading

Add Integer sequence generator #166

Add Integer sequence generator #166

Conversation

rlouf commented Jun 29, 2023 • edited Loading

rlouf commented Jul 6, 2023 • edited Loading

TODO

brandonwillard left a comment

Choose a reason for hiding this comment

brandonwillard Jul 6, 2023

Choose a reason for hiding this comment

rlouf commented Jul 7, 2023 • edited Loading

Add `Integer` sequence generator #166

Add `Integer` sequence generator #166

rlouf commented Jun 29, 2023 •

edited

Loading

rlouf commented Jul 6, 2023 •

edited

Loading

rlouf commented Jul 7, 2023 •

edited

Loading