Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Infinite Repetition in JSON Schemas Using Integer and String #1154

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented Sep 16, 2024

Overview

The repetition problem of language models combined with patterns allowing for infinite-length fields results in broken JSON Schema outputs.

This was addressed previously for infinite whitespaces issues by setting a safe whitespace pattern as the default. In this PR, the safety of whitespaces is extended to Integer and String patterns.

Behavior

json_schema.to_regex now includes a kwarg safe_subset=True.

safe_subset=False

  • Whitespace: r"[\n\t ]*"
  • Integer: any number
  • String: any string

safe_subset=True (default)

  • Whitespace: r"[ ]?"
  • Integer: (-1e19, 1e19)
  • String: Any string of length (0, 256)

Fixes

Safe Integer

Safe String

Further Work

@lapp0 lapp0 force-pushed the fix-json-schema branch 3 times, most recently from 25cb1c1 to db309ef Compare September 16, 2024 04:28
@lapp0 lapp0 marked this pull request as ready for review September 16, 2024 04:31
return f'"{STRING_INNER}{{{min_length},{max_length}}}"'


def build_regex_from_schema(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be at the top. We might also consider separating into multiple modules so it's easier to read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I generally like to move helper functions to the end of the file.

@lapp0 lapp0 force-pushed the fix-json-schema branch 2 times, most recently from cb970c3 to 985dc9c Compare September 16, 2024 20:37
@rlouf
Copy link
Member

rlouf commented Sep 17, 2024

@cpfiffer fyi

@lapp0
Copy link
Contributor Author

lapp0 commented Sep 17, 2024

We might want to hold off on this one actually. I did some profiling on get_str_pattern. Constrained strings have a large state and take a long time to compile.

>>> len(interegular.parse_pattern(STRING_INNER + "{,256}").to_fsm().reduce().states)
513
>>> len(interegular.parse_pattern(STRING_INNER + "*").to_fsm().reduce().states)
2

The better alternative is to

Alternatively we could reduce the size of safe_subset str to something like 20 instead of 256.

Let me know if this makes sense

@lapp0 lapp0 marked this pull request as draft September 17, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants