Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser-driven masking updates for basic "off-line" vocabulary pre-parsing #171

Merged
merged 2 commits into from
Jul 6, 2023

Conversation

brandonwillard
Copy link
Contributor

@brandonwillard brandonwillard commented Jul 5, 2023

This PR picks up from #131 by extending/refactoring the partial parsing functionality so that it can be used more easily to pre-"parse" vocabularies in cases simpler than general context-free grammars (e.g. regex-only masking).

This is a subset of the functionality implied by https://github.com/normal-computing/outlines/issues/170.

@brandonwillard brandonwillard added text Linked to text generation enhancement examples Linked to usage examples labels Jul 5, 2023
)
with pytest.raises(UnexpectedToken):
parse_to_end(parser_state)


def test_map_partial_states_to_vocab_regex():
Copy link
Contributor Author

@brandonwillard brandonwillard Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_map_partial_states_to_vocab_regex is a condensed and complete illustration of the partial matching and vocabulary pre-parsing approach. It shows how a partial state index/map is created from a vocabulary and how it can be used to efficiently determine the support of the next-token distribution—without parsing each element of the vocabulary or sampling full sequences from an LM and accepting/rejecting them.

return False
return True

pstate_to_vocab = map_partial_states_to_vocab(
Copy link
Member

@rlouf rlouf Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add function that maps the states of masks over the vocabulary, so these are not re-computed at each step (provided the memory footprint is not too large). But it can be done in a later PR when needed, #166 for instance.



def test_map_partial_states_to_vocab_regex():
regex_string = r"(([0-9]+)?([.]([0-9]*)?)?|[.][0-9]+)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a reminder for myself later: this regex matches floats with leading zeros e.g. "01.34". The following regex forbids them: ^((0|[1-9][0-9]+)?([.]([0-9]*)?)?|[.][0-9]+)$

@rlouf
Copy link
Member

rlouf commented Jul 6, 2023

I only had a couple comments that don't necessarily need to be addressed here. LGTM.

@rlouf rlouf merged commit c855860 into outlines-dev:main Jul 6, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement examples Linked to usage examples text Linked to text generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants