Parser-driven masking updates for basic "off-line" vocabulary pre-parsing #171

brandonwillard · 2023-07-05T22:16:44Z

This PR picks up from #131 by extending/refactoring the partial parsing functionality so that it can be used more easily to pre-"parse" vocabularies in cases simpler than general context-free grammars (e.g. regex-only masking).

This is a subset of the functionality implied by https://github.com/normal-computing/outlines/issues/170.

brandonwillard · 2023-07-05T22:52:34Z

tests/text/test_parsing.py

    )
    with pytest.raises(UnexpectedToken):
        parse_to_end(parser_state)
+
+
+def test_map_partial_states_to_vocab_regex():


The test test_map_partial_states_to_vocab_regex is a condensed and complete illustration of the partial matching and vocabulary pre-parsing approach. It shows how a partial state index/map is created from a vocabulary and how it can be used to efficiently determine the support of the next-token distribution—without parsing each element of the vocabulary or sampling full sequences from an LM and accepting/rejecting them.

rlouf · 2023-07-06T12:34:42Z

tests/text/test_parsing.py

+            return False
+        return True
+
+    pstate_to_vocab = map_partial_states_to_vocab(


We should add function that maps the states of masks over the vocabulary, so these are not re-computed at each step (provided the memory footprint is not too large). But it can be done in a later PR when needed, #166 for instance.

rlouf · 2023-07-06T12:44:39Z

tests/text/test_parsing.py

+
+
+def test_map_partial_states_to_vocab_regex():
+    regex_string = r"(([0-9]+)?([.]([0-9]*)?)?|[.][0-9]+)"


More of a reminder for myself later: this regex matches floats with leading zeros e.g. "01.34". The following regex forbids them: ^((0|[1-9][0-9]+)?([.]([0-9]*)?)?|[.][0-9]+)$

rlouf · 2023-07-06T12:46:49Z

I only had a couple comments that don't necessarily need to be addressed here. LGTM.

Minor typing, imports, and name refactoring

5c1b3a6

brandonwillard assigned brandonwillard and rlouf Jul 5, 2023

brandonwillard added text Linked to text generation enhancement examples Linked to usage examples labels Jul 5, 2023

Make map_partial_states_to_vocab return vocab indices and filter matches

7011735

brandonwillard force-pushed the basic-preparsing-updates branch from 14257c7 to 7011735 Compare July 5, 2023 22:43

brandonwillard commented Jul 5, 2023

View reviewed changes

rlouf reviewed Jul 6, 2023

View reviewed changes

rlouf approved these changes Jul 6, 2023

View reviewed changes

rlouf merged commit c855860 into outlines-dev:main Jul 6, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser-driven masking updates for basic "off-line" vocabulary pre-parsing #171

Parser-driven masking updates for basic "off-line" vocabulary pre-parsing #171

brandonwillard commented Jul 5, 2023 •

edited

Loading

brandonwillard Jul 5, 2023 •

edited

Loading

rlouf Jul 6, 2023 •

edited

Loading

rlouf Jul 6, 2023

rlouf commented Jul 6, 2023



		def test_map_partial_states_to_vocab_regex():
		regex_string = r"(([0-9]+)?([.]([0-9]*)?)?\|[.][0-9]+)"

Parser-driven masking updates for basic "off-line" vocabulary pre-parsing #171

Parser-driven masking updates for basic "off-line" vocabulary pre-parsing #171

Conversation

brandonwillard commented Jul 5, 2023 • edited Loading

brandonwillard Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

rlouf Jul 6, 2023 • edited Loading

Choose a reason for hiding this comment

rlouf Jul 6, 2023

Choose a reason for hiding this comment

rlouf commented Jul 6, 2023

brandonwillard commented Jul 5, 2023 •

edited

Loading

brandonwillard Jul 5, 2023 •

edited

Loading

rlouf Jul 6, 2023 •

edited

Loading