Add basic parser-driven masking utilities #131

brandonwillard · 2023-06-06T05:13:52Z

This PR introduces utilities for parser-driven masking: i.e. a parser is run alongside a language model's (LM) token sampling step and used to filter for grammar-valid next tokens. lark was chosen as the underlying parser library because I have previous experience using it, its design is fairly straightforward, it's pure Python, and it supports arbitrary EBNF-like grammars.

The current approach is very experimental and is essentially driven by some patches to lark's LALR parser. Those patches allow for incremental additions to a lexed/parsed string, and, more importantly, they allow for partial token matches during scanning. Partial tokens arise when the string being scanned cuts off without completely constructing/finalizing a token.

For example, the source string "def fo" would normally be scanned into two tokens def and fo; however, if the next LM-sampled token is "o(", the complete sampled source string would be "def foo(", and the parsed result would be the tokens def, foo, (. These two scan (or parse) results conflict, because the first one ends by assuming that fo is a completed name token, but the next result—when starting from the beginning—contains no such token. In other words, we would need to backtrack in order to find the "correct" name token (i.e. foo).

The approach currently used in this PR doesn't immediately accept (i.e. advance the parse state) scan results from tokens with ambiguities like that.

brandonwillard · 2023-07-05T20:50:46Z

The iterative parsing has been usable for a while (e.g. see the full Python demo) and the tools for partial parsing, pre-parsing, and indexing/caching vocabulary tokens are in place. I'll merge this as-is so that we can split the remaining work (e.g. bullet points in the description) into separate issues/PRs and focus on them independently.

brandonwillard marked this pull request as draft June 6, 2023 05:14

brandonwillard requested a review from rlouf June 6, 2023 05:14

brandonwillard added text Linked to text generation enhancement question examples Linked to usage examples labels Jun 6, 2023

brandonwillard force-pushed the grammar-filtering branch 2 times, most recently from 44c77df to e5e8b22 Compare June 12, 2023 16:44

brandonwillard mentioned this pull request Jun 22, 2023

Add a Float generation method #154

Closed

brandonwillard force-pushed the grammar-filtering branch from e5e8b22 to 84d46ee Compare June 22, 2023 20:00

rlouf mentioned this pull request Jun 29, 2023

Add Integer sequence generator #166

Merged

brandonwillard force-pushed the grammar-filtering branch 6 times, most recently from b518af8 to 6a4063a Compare June 30, 2023 22:14

brandonwillard added 2 commits July 3, 2023 13:45

Add basic parser-driven masking utilities

acbe240

Add vocabulary pre-parsing tools

44c2996

brandonwillard force-pushed the grammar-filtering branch from 6a4063a to 44c2996 Compare July 3, 2023 18:45

rlouf approved these changes Jul 5, 2023

View reviewed changes

brandonwillard marked this pull request as ready for review July 5, 2023 20:46

brandonwillard merged commit a034c78 into outlines-dev:main Jul 5, 2023
4 checks passed

brandonwillard deleted the grammar-filtering branch July 5, 2023 20:51

This was referenced Jul 5, 2023

Create a parser demo that uses the new Sequence functionality. #169

Closed

Parser-driven masking updates for basic "off-line" vocabulary pre-parsing #171

Merged

brandonwillard mentioned this pull request Jul 22, 2023

llama : add grammar-based sampling ggerganov/llama.cpp#1773

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic parser-driven masking utilities #131

Add basic parser-driven masking utilities #131

brandonwillard commented Jun 6, 2023 •

edited

Loading

brandonwillard commented Jul 5, 2023

Add basic parser-driven masking utilities #131

Add basic parser-driven masking utilities #131

Conversation

brandonwillard commented Jun 6, 2023 • edited Loading

brandonwillard commented Jul 5, 2023

brandonwillard commented Jun 6, 2023 •

edited

Loading