Add `Regex` generation method #175

rlouf · 2023-07-07T15:44:54Z

This trivially generalizes #166 by defining a Regex class that can be initialized with any regex string. The integer function now instantiates Regex with the appropriate regex string.

General Regex class
Function integer to initialize Regex with a regex that only matches integers
Function float to initialize Regex with a regex that only matches integers
Handle EOS tokens for open-ended sequences

brandonwillard

@rlouf, I added the EOS handling and padding.

tests/text/generate/test_integer.py

brandonwillard

FYI: The current error in CI looks like a "determinism" problem involving the numbering/order of the FSM states.

rlouf · 2023-07-10T16:27:52Z

I would like to add a test to define the behavior when there is no possible match in the vocabulary. Then it should be ready to merge.

rlouf · 2023-07-12T13:30:41Z

Now the Regex class raises an exception when the vocabulary does not allow to build sequences that match the input regex. Ready for review.

brandonwillard

@rlouf, I just pushed a change that makes the FSM start from the previous state it was in: i.e. it avoids decoding and rerunning the FSM for the entire sequence on each iteration. Tell me if that looks fine; otherwise, everything else looks good to me.

brandonwillard · 2023-07-12T16:20:10Z

Agh, looks like the generated token sequences change shape when using transformers!

brandonwillard · 2023-07-13T00:19:16Z

outlines/text/generate/regex.py

+        # TODO: This check might be a little too strict, because I think that
+        # while some states are made unreachable by a vocabulary (and will not
+        # be present in the following set difference), there could still be
+        # paths to terminal states emanating from the states that are reachable.
+        states_with_transition = {x[1] for x in pstate_to_vocab.keys()}
+        if len(self.regex_fsm.states.difference(states_with_transition)) > 0:
+            raise ValueError(
+                "The vocabulary does not allow us to build a sequence that matches the input regex"
+            )


We should look into this. Perhaps as a follow-up issue.

Good point. I opened #184 to track this.

brandonwillard · 2023-07-13T00:20:25Z

outlines/text/masks.py

-    mask = create_mask_from_regex(vocabulary, "^[0-9]+$")
+    mask = create_mask_from_regex(vocabulary, r"(0|[+-]?[1-9][0-9]+?)")

    return mask


 def create_float_mask(vocabulary: Dict[str, int]) -> torch.BoolTensor:
    """Create a mask to generate floating point numbers."""
-    mask = create_mask_from_regex(vocabulary, r"^(([0-9]+)?([.]([0-9]*)?)?|[.][0-9]+)$")
+    mask = create_mask_from_regex(
+        vocabulary, r"([+-]?((0|[1-9]+)([.][0-9]*)?)|([.][0-9]+))"
+    )


The tests were getting a little flaky in CI, so I had to add these updates.

This allows one to add EOS transitions to the partial-parse-state-to-vocabulary maps produced by map_partial_states_to_vocab.

This refactoring also removed the need for the antecedent mapping option in `map_partial_states_to_vocab`.

NA

brandonwillard

OK, I think it's good to go now.

rlouf · 2023-07-13T10:31:32Z

outlines/text/generate/regex.py

+                # Get the tokens we haven't already processed
+                readable_tokens = token_seq[last_token_idx:]
+                # excluding any EOS tokens
+                not_eos_mask = [


You should never get a sequence with an EOS token here. Those are filtered out in Sequence.__call__. Is it still worth keeping this check?

If you start it out with a sequence like [[10, 2, 0, 0]], you would only want to process the first two. That's what it should be doing.

When would that happen if you cannot get sequences with 0 by design?

I believe I had to add it for the tests at one point, but that might not longer be true.

rlouf · 2023-07-13T10:37:41Z

I only have one minor comment regarding EOS tokens. Sequence.__call__ filters finished sequences, and Regex inherits from Continuation which marks a sequence as finished when an EOS token is found. We can open a follow-up issue for this.

rlouf added text Linked to text generation enhancement labels Jul 7, 2023

brandonwillard force-pushed the regex-generation branch from 434f9aa to 2c72f46 Compare July 7, 2023 20:56

brandonwillard linked an issue Jul 7, 2023 that may be closed by this pull request

Manage EOS in regex-based generation #174

Closed

brandonwillard reviewed Jul 7, 2023

View reviewed changes

tests/text/generate/test_integer.py Outdated Show resolved Hide resolved

brandonwillard force-pushed the regex-generation branch from 2c72f46 to e1f8c55 Compare July 7, 2023 21:10

brandonwillard reviewed Jul 7, 2023

View reviewed changes

brandonwillard force-pushed the regex-generation branch from e1f8c55 to a4f1108 Compare July 8, 2023 17:28

brandonwillard mentioned this pull request Jul 10, 2023

Use FSMs for scanning during grammar-guided generation #178

Merged

4 tasks

rlouf force-pushed the regex-generation branch 3 times, most recently from 0864d34 to e8323b2 Compare July 10, 2023 14:03

rlouf linked an issue Jul 11, 2023 that may be closed by this pull request

Add a Float generation method #154

Closed

rlouf force-pushed the regex-generation branch from e8323b2 to 171705d Compare July 12, 2023 13:17

rlouf requested a review from brandonwillard July 12, 2023 13:29

brandonwillard force-pushed the regex-generation branch from 171705d to 85e3fbc Compare July 12, 2023 16:07

brandonwillard previously approved these changes Jul 12, 2023

View reviewed changes

brandonwillard force-pushed the regex-generation branch 5 times, most recently from 7d0a35c to ec10981 Compare July 13, 2023 00:17

brandonwillard reviewed Jul 13, 2023

View reviewed changes

brandonwillard added 3 commits July 12, 2023 19:44

Fix float and int masks

968cd8a

Add start_state option to find_partial_matches

7b47189

Add final_state_string option to map_partial_states_to_vocab

d4f5b62

This allows one to add EOS transitions to the partial-parse-state-to-vocabulary maps produced by map_partial_states_to_vocab.

brandonwillard and others added 2 commits July 12, 2023 19:44

Refactor find_partial_matches so that it returns full sequences

39e7302

This refactoring also removed the need for the antecedent mapping option in `map_partial_states_to_vocab`.

Add Regex generation method

ad9d062

brandonwillard force-pushed the regex-generation branch from ec10981 to ad9d062 Compare July 13, 2023 00:59

brandonwillard self-requested a review July 13, 2023 01:00

brandonwillard approved these changes Jul 13, 2023

View reviewed changes

rlouf commented Jul 13, 2023

View reviewed changes

rlouf merged commit 34bc2fb into outlines-dev:main Jul 13, 2023
4 checks passed

rlouf deleted the regex-generation branch July 13, 2023 10:38

rlouf mentioned this pull request Jul 13, 2023

Generate a choice between different strings #188

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Regex` generation method #175

Add `Regex` generation method #175

rlouf commented Jul 7, 2023 •

edited

Loading

brandonwillard left a comment

brandonwillard left a comment •

edited

Loading

rlouf commented Jul 10, 2023

rlouf commented Jul 12, 2023

brandonwillard left a comment

brandonwillard commented Jul 12, 2023

brandonwillard Jul 13, 2023

rlouf Jul 13, 2023

brandonwillard Jul 13, 2023

brandonwillard left a comment

rlouf Jul 13, 2023 •

edited

Loading

brandonwillard Jul 13, 2023

rlouf Jul 13, 2023

brandonwillard Jul 13, 2023

rlouf commented Jul 13, 2023

Add Regex generation method #175

Add Regex generation method #175

Conversation

rlouf commented Jul 7, 2023 • edited Loading

brandonwillard left a comment

Choose a reason for hiding this comment

brandonwillard left a comment • edited Loading

Choose a reason for hiding this comment

rlouf commented Jul 10, 2023

rlouf commented Jul 12, 2023

brandonwillard left a comment

Choose a reason for hiding this comment

brandonwillard commented Jul 12, 2023

brandonwillard Jul 13, 2023

Choose a reason for hiding this comment

rlouf Jul 13, 2023

Choose a reason for hiding this comment

brandonwillard Jul 13, 2023

Choose a reason for hiding this comment

brandonwillard left a comment

Choose a reason for hiding this comment

rlouf Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

brandonwillard Jul 13, 2023

Choose a reason for hiding this comment

rlouf Jul 13, 2023

Choose a reason for hiding this comment

brandonwillard Jul 13, 2023

Choose a reason for hiding this comment

rlouf commented Jul 13, 2023

Add `Regex` generation method #175

Add `Regex` generation method #175

rlouf commented Jul 7, 2023 •

edited

Loading

brandonwillard left a comment •

edited

Loading

rlouf Jul 13, 2023 •

edited

Loading