Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle regexps with lookaround #456

Open
benlipkin opened this issue Dec 20, 2023 · 7 comments
Open

Handle regexps with lookaround #456

benlipkin opened this issue Dec 20, 2023 · 7 comments
Labels
enhancement structured generation Linked to structured generation

Comments

@benlipkin
Copy link
Contributor

What behavior of the library made you think about the improvement?

Outlines currently uses interegular to compile regexps to FSMs, but not all features of python regex are supported, in particular lookarounds.

I ran into this issue when trying to test the following grammar:

json_grammar = r"""
    ?start: value

    ?value: object
          | array
          | string
          | SIGNED_NUMBER      -> number
          | "true"             -> true
          | "false"            -> false
          | "null"             -> null

    array  : "[" [value ("," value)*] "]"
    object : "{" [pair ("," pair)*] "}"
    pair   : string ":" value

    string : ESCAPED_STRING

    %import common.ESCAPED_STRING
    %import common.SIGNED_NUMBER
    %import common.WS

    %ignore WS
"""

In the CFGFSM constructor, terminal patterns are compiled to regexps so that they can later be used to initialize RegexFSM proposals, e.g., RegexFSM(terminal.pattern.to_regexp()). When the supplied regex includes lookarounds, we get an error such as the following:

interegular.patterns.Unsupported: lookbacks are not implemented

How would you like it to behave?

It would seem there are at least a couple options here:

  1. Work with interegular to expand support to lookarounds. This is on the TODOs outlined in their README, and perhaps could use a push.
  2. Compile regex with lookaround to regex without lookaround? In theory, lookaround does not change the expressive power and this is all closed within regular languages. However, I do not know if any (efficient) algorithm exists to make this conversion (and in some cases the resulting regexp might be quite complex). Maybe the authors know more about this.
@brandonwillard brandonwillard added the structured generation Linked to structured generation label Dec 20, 2023
@amir-in-a-cynch
Copy link

+1, getting similar issues with even a subset of json that's flat key/value pairs only.

"""
start: dict

dict : "{" [pair ("," pair)*] "}"
pair : ESCAPED_STRING ":" ESCAPED_STRING

%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS

"""

This should be high priority in my opinion, for supporting a realistic set of grammars.

@benlipkin
Copy link
Contributor Author

As a quick workaround for those interested particularly in JSON, here's what I've been doing.

The unsupported rule from the above grammar is the Lark implementation of ESCAPED_STRING. I just rewrote this for JSON specifically, which only requires double quotes to be escaped so was a pretty quick rewrite. My new implementation is as follows:

json_grammar = r"""
    ?start: value

    ?value: object
          | array
          | string
          | SIGNED_NUMBER      -> number
          | "true"             -> true
          | "false"            -> false
          | "null"             -> null

    array  : "[" [value ("," value)*] "]"
    object : "{" [pair ("," pair)*] "}"
    pair   : string ":" value

    inner: /([^"]|\\\")+/ | 
    string : "\"" inner "\""

    %import common.SIGNED_NUMBER
    %import common.WS

    %ignore WS
"""

As a summary, string : "\"" inner "\"" ensures any string is captured by quotes, and inner: /([^"]|\\\")+/ | supports empty strings, non-empty strings without double quotes, or double quotes only if preceded by a backslash.

This does not globally solve the issue, and this broader problem related to lookarounds remains open, but hopefully this can help folks who are running into this in the meantime. For other languages requiring other characters to be escaped for their string implementations, this approach can be quickly adjusted as well.

@brandonwillard
Copy link
Contributor

Yeah, we could at the very least offer replacements for the common.* definitions that don't unnecessarily use look-arounds and other non-regular features.

@lapp0
Copy link
Collaborator

lapp0 commented Jan 7, 2024

I don't see any non-regular examples in common.lark other than ESCAPED_STRING and C_COMMENT. https://github.com/lark-parser/lark/blob/ab11ead8316978d1d36f35adb6fc88bdae43c2c4/lark/grammars/common.lark

@lapp0 lapp0 mentioned this issue Jan 19, 2024
5 tasks
@lapp0
Copy link
Collaborator

lapp0 commented Jan 21, 2024

interegular uses a finite state machine which cannot handle non-regular features such as look-arounds.

We could convert non-regular lark terminals to lark rules with multiple regular lark terminals programmatically at runtime, but this seems like overkill. I think we should just document the fact that "Outlines-lark" is a subset of lark which disallows non-regular terminals.

Another problem I've come across experimenting with the python.lark grammar is that Lark currently compiles some definitions into regexp with lookarounds https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L631-L655

We either need to monkeypatch TerminalTreeToPattern into a no-op, or submit a patch to Lark.

Concretely, the following lark definition

HEX_NUMBER.2: "0" ("x" | "X") ("_"? ("0".."9" | "a".."f" | "A".."F"))+

is compiled into

0(?:x|X)(?:(?:_)?(?:[0-9]|[a-f]|[A-F]))+

@benlipkin
Copy link
Contributor Author

@lapp0 , I don't think that compiling to DFA precludes the handling of lookaround (from a theoretical perspective). Other features of python regex like backreference are indeed not regular, but lookaround is a pragmatic add-on that doesn't change expressive power. That being said, it doesn't appear trivial to "desugar" in practice, and many regex engines actually do implement lookaround with backtracking (even though it's not theoretically necessary).

Some community discussions I've found on this topic:

It also looks like there's some contemporary work on actually implementing non-backtracking regex engines that can support lookaround, but what I've found so far uses a different symbolic derivative-based formalism, rather than the automata-based formalism of interegular, so not sure how much is trivially translatable.

All this to say, this isn't a theoretical limitation. I just don't know enough to approach this problem in practice though.

I think your suggestion of avoiding lookaround, providing alternatives for common grammars, and attempting to minimize their construction during lark compilation, seems like probably the best value strategy that should get things most of the way there for real use cases.

Thanks to the maintainers for continuing to work on this issue!

@lapp0
Copy link
Collaborator

lapp0 commented Jan 21, 2024

Thanks for the info @benlipkin, glad to see I was mistaken and look-arounds are allowed within a regular language.

I've tried to integrate some common lark grammars into Outlines for #562 however most of them explicitly or implicitly result in terminals with look-arounds. I'll look into how I might integrate lookarounds into interegular.

rlouf pushed a commit that referenced this issue May 17, 2024
Fixes #823

This comment details the issues error:
#823 (comment)

The reproduction code provided results in a json schema with
`OneOf[pets]`:

```
class Model(BaseModel):
    pet: Union[Cat, Dog] = Field(..., discriminator='pet_type')
```

Before this PR: `OneOf` uses negative lookaheads to assert that only one
schema member is included. This is illegal in `interegular`, more
details available here:
#456

After `OneOf` uses or-joined non-capturing groups which don't have the
same issues with `interegular`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement structured generation Linked to structured generation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants