Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Integer sequence generator #166

Merged
merged 1 commit into from
Jul 7, 2023

Conversation

rlouf
Copy link
Member

@rlouf rlouf commented Jun 29, 2023

In this PR I introduce a sequence generator that only outputs integers. Closes #153.

The generator can currently generate sequences that have leading zeros, which are then trimmed during post-processing. I am not sure what we should do about it. On one hand we can ignore it (maybe the LLM has seen this during training), on the other hand we could use the pre-processing approach described in #131.

@rlouf rlouf added text Linked to text generation enhancement labels Jun 29, 2023
@rlouf rlouf force-pushed the add-integer-generation branch 2 times, most recently from 88ebc06 to 47b236a Compare June 29, 2023 14:21
@rlouf rlouf force-pushed the add-integer-generation branch 2 times, most recently from 1560beb to 9cdc0a0 Compare July 6, 2023 13:51
@rlouf
Copy link
Member Author

rlouf commented Jul 6, 2023

The code now makes use of the tools introduced in #131 and the follow-up #170 to pre-compute the masks and advance the parsing state. We allow the EOS to be generated as well, and forbid leading zeros.

TODO

  • Move torch.nn.softmax from Transformers.__call__ to Sequence.step after creating the proposal
  • Make this work for batch generation

@rlouf rlouf force-pushed the add-integer-generation branch 2 times, most recently from 1705594 to 13ad481 Compare July 6, 2023 14:53
@rlouf rlouf marked this pull request as ready for review July 6, 2023 14:56
@rlouf rlouf force-pushed the add-integer-generation branch 2 times, most recently from 7e7b635 to 130ec8b Compare July 6, 2023 15:04
Copy link
Contributor

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments/questions; looks good otherwise.

"""
if generated_token_ids.shape[-1] > 0:
sampled_sequence = self.model.tokenizer.decode(generated_token_ids)
partial_matches = find_partial_matches(self.int_regex_fsm, sampled_sequence)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder that we can avoid reparsing from the beginning of the sequence, but that's fine as a follow-up.

outlines/text/generate/integer.py Outdated Show resolved Hide resolved
@rlouf rlouf force-pushed the add-integer-generation branch 2 times, most recently from f43cc43 to f1cb953 Compare July 7, 2023 14:22
@rlouf
Copy link
Member Author

rlouf commented Jul 7, 2023

The integer generation is working, although we currently cannot handle the end-of-sequence (EOS) token properly. To generate integers, we'd like the generated text to match the regex (0|[1-9][0-9]+)(<EOS>) where <EOS> is the EOS token. The corresponding finite-state machine contains the transitions "<" -> "E" -> "O" -> "S" -> ">", but we can be in a situation where for instance <EOS is in the vocabulary but > is not, stalling the generation. Even if > were part of the vocabulary the generation would still be wrong as we only want to allow transitions to the token id that corresponds to <EOS>.

Since EOS is always a final state we shouldn't have too much trouble modifying the alphabet/map of the FSM to get the desired behavior. This is non blocking, will open an issue that will need a follow-up PR.

@rlouf rlouf merged commit 48b91ea into outlines-dev:main Jul 7, 2023
4 checks passed
@rlouf rlouf deleted the add-integer-generation branch July 7, 2023 15:28
@rlouf rlouf mentioned this pull request Jul 7, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement text Linked to text generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an Integer generation method
2 participants