Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Illegal Look-Around for OneOf in JSONSchema #897

Merged
merged 2 commits into from
May 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 1 addition & 7 deletions outlines/fsm/json_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,13 +195,7 @@ def to_regex(
to_regex(resolver, t, whitespace_pattern) for t in instance["oneOf"]
]

xor_patterns = []
# json schema validation ensured there is no overlapping schemas in oneOf
for subregex in subregexes:
other_subregexes = filter(lambda r: r != subregex, subregexes)
other_subregexes_str = "|".join([f"{s}" for s in other_subregexes])
negative_lookahead = f"(?!.*({other_subregexes_str}))"
xor_patterns.append(f"({subregex}){negative_lookahead}")
xor_patterns = [f"(?:{subregex})" for subregex in subregexes]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lapp0 what is the difference bw oneOf and anyOf in this script? Both concat the subregex with |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also think additional conditions like just checking for ^ and just checking for $ needs to be added to this conditon?

if pattern[0] == "^" and pattern[-1] == "$":
return rf'(^"{pattern[1:-1]}"$)'

Copy link

@ekagra-ranjan ekagra-ranjan May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how about adding array along with object to the types to ignore here?

regexes = [
to_regex(resolver, {"type": t}, whitespace_pattern)
for t in instance_type
if t != "object"
Array like object requires many more parameters like items which is not provided in the list of types which is the reason why object is ignored I believe since properties cannot be mentioned when type is given as a list

Copy link
Collaborator Author

@lapp0 lapp0 May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekagra-ranjan actually reviewing the test cases, I think anyOf is wrong. I'd appreciate a second set of eyes on this. Would you agree that

oneOf is correct (uses exclusive or), guarantees only one match

>>> print(re.fullmatch(r"((?:foo)|(?:bar))", "foo"))
<re.Match object; span=(0, 3), match='foo'>
>>> print(re.fullmatch(r"((?:foo)|(?:bar))", "foobar"))
None

anyOf is incorrect, allows only one match (I think this regression was introduced in https://github.com/outlines-dev/outlines/pull/552/files)

>>> print(re.fullmatch(r"((foo)|(bar))", "foo"))
<re.Match object; span=(0, 3), match='foo'>
>>> print(re.fullmatch(r"((foo)|(bar))", "foobar"))
None

allOf is somewhat correct, requires generation of all schemas, but enforces an order which it shouldn't:

    elif "allOf" in instance:
        subregexes = [
            to_regex(resolver, t, whitespace_pattern) for t in instance["allOf"]
        ]
        subregexes_str = [f"{subregex}" for subregex in subregexes]
        return rf"({''.join(subregexes_str)})"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your other two comments, I appreciate your insights into json schema -> regex. It's definitely a work in progress and it's great that you're looking deep into it's behavior and identifying problems. I think it's out of scope with this PR. Could you create a separate issue so others can provide input as well?


return rf"({'|'.join(xor_patterns)})"

Expand Down
31 changes: 28 additions & 3 deletions tests/fsm/test_json_schema.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
import json
import re
from typing import List
from typing import List, Literal, Union

import interegular
import pytest
from pydantic import BaseModel, constr
from pydantic import BaseModel, Field, constr

from outlines.fsm.json_schema import (
BOOLEAN,
Expand Down Expand Up @@ -321,7 +322,7 @@ def test_match_number(pattern, does_match):
"title": "Foo",
"oneOf": [{"type": "string"}, {"type": "number"}, {"type": "boolean"}],
},
rf"(({STRING})(?!.*({NUMBER}|{BOOLEAN}))|({NUMBER})(?!.*({STRING}|{BOOLEAN}))|({BOOLEAN})(?!.*({STRING}|{NUMBER})))",
rf'((?:"{STRING_INNER}*")|(?:{NUMBER})|(?:{BOOLEAN}))',
[
("12.3", True),
("true", True),
Expand Down Expand Up @@ -750,3 +751,27 @@ class MockModel(BaseModel):
assert match_default_ws is None

assert re.fullmatch(pattern, mock_result_maybe_ws)


def test_one_of_doesnt_produce_illegal_lookaround():
"""Reproduces failure in https://github.com/outlines-dev/outlines/issues/823"""

class Cat(BaseModel):
pet_type: Literal["cat"]
meows: int

class Dog(BaseModel):
pet_type: Literal["dog"]
barks: float

class Model(BaseModel):
pet: Union[Cat, Dog] = Field(..., discriminator="pet_type")
n: int

json_schema = Model.schema_json()

json_schema = Model.schema_json()
pattern = build_regex_from_schema(json_schema, whitespace_pattern=None)

# check if the pattern uses lookarounds incompatible with interegular.Pattern.to_fsm()
interegular.parse_pattern(pattern).to_fsm()
Loading