Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic parser-driven masking utilities #131

Merged
merged 2 commits into from
Jul 5, 2023

Conversation

brandonwillard
Copy link
Contributor

@brandonwillard brandonwillard commented Jun 6, 2023

This PR introduces utilities for parser-driven masking: i.e. a parser is run alongside a language model's (LM) token sampling step and used to filter for grammar-valid next tokens. lark was chosen as the underlying parser library because I have previous experience using it, its design is fairly straightforward, it's pure Python, and it supports arbitrary EBNF-like grammars.

The current approach is very experimental and is essentially driven by some patches to lark's LALR parser. Those patches allow for incremental additions to a lexed/parsed string, and, more importantly, they allow for partial token matches during scanning. Partial tokens arise when the string being scanned cuts off without completely constructing/finalizing a token.

For example, the source string "def fo" would normally be scanned into two tokens def and fo; however, if the next LM-sampled token is "o(", the complete sampled source string would be "def foo(", and the parsed result would be the tokens def, foo, (. These two scan (or parse) results conflict, because the first one ends by assuming that fo is a completed name token, but the next result—when starting from the beginning—contains no such token. In other words, we would need to backtrack in order to find the "correct" name token (i.e. foo).

The approach currently used in this PR doesn't immediately accept (i.e. advance the parse state) scan results from tokens with ambiguities like that.

@brandonwillard brandonwillard marked this pull request as draft June 6, 2023 05:14
@brandonwillard brandonwillard requested a review from rlouf June 6, 2023 05:14
@brandonwillard brandonwillard added text Linked to text generation enhancement question examples Linked to usage examples labels Jun 6, 2023
@brandonwillard brandonwillard force-pushed the grammar-filtering branch 2 times, most recently from 44c77df to e5e8b22 Compare June 12, 2023 16:44
@brandonwillard brandonwillard force-pushed the grammar-filtering branch 6 times, most recently from b518af8 to 6a4063a Compare June 30, 2023 22:14
@brandonwillard brandonwillard marked this pull request as ready for review July 5, 2023 20:46
@brandonwillard
Copy link
Contributor Author

The iterative parsing has been usable for a while (e.g. see the full Python demo) and the tools for partial parsing, pre-parsing, and indexing/caching vocabulary tokens are in place. I'll merge this as-is so that we can split the remaining work (e.g. bullet points in the description) into separate issues/PRs and focus on them independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement examples Linked to usage examples question text Linked to text generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants