Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix invalid regex in unconstrained arrays for json_schema.py #919

Merged
merged 1 commit into from
May 27, 2024

Conversation

lapp0
Copy link
Collaborator

@lapp0 lapp0 commented May 25, 2024

Fixes #913

Enables complete version of PR #914 which allows arrays

Problems

  • Unconstrained arrays (arrays without any items types specified) would always result in an illegal regex
  • Unconstrained arrays couldn't contain objects or arrays

Change

  • Fix Unconstrained array pattern so it doesn't produce an illegal pattern
  • Allow objects and arrays within arrays, but use "mock" depth key to prevent infinite recursion

@rlouf rlouf merged commit 538f77a into outlines-dev:main May 27, 2024
5 checks passed
regexes = [
to_regex(resolver, t, whitespace_pattern) for t in legal_types
]
return rf"\[{whitespace_pattern}({'|'.join(regexes)})(,{whitespace_pattern}({'|'.join(regexes)})){num_repeats}{allow_empty}{whitespace_pattern}\]"
Copy link

@ekagra-ranjan ekagra-ranjan Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lapp0 - Right now the array will have at least one element. I think there is missing a parentheses to denote that we want the entire array to be optional and not just the elements which start with , i.e.

rf"\[({whitespace_pattern}({'|'.join(regexes)})(,{whitespace_pattern}({'|'.join(regexes)})){num_repeats}){allow_empty}{whitespace_pattern}\]"

Note the starting bracket after rf"\[ and closing bracket before {allow_empty}.
Wdyt?

Copy link
Collaborator Author

@lapp0 lapp0 Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored a lot of this code so we can do constrained yaml. It's a lot more liberal with parenthesis use. Do you think this makes sense to fix the problem?

    def format_object_with_additional_properties(self, value_pattern, min_properties=None, max_properties=None):
        if max_properties == 0:
            return rf"\{{{self.ws}\}}"

        inner = self._regex_repeat_elem(
            elem_pattern=f"{STRING}{self.ws}:{self.ws}{value_pattern}",
            separator_pattern=f"{self.ws},{self.ws}",
            min_elem=min_properties,
            max_elem=max_properties,
            pad=self.ws
        )
        return rf'{{{inner}}}'
    def _regex_repeat_elem(self, elem_pattern, separator_pattern, min_elem=None, max_elem=None, pad=""):
        """Creates a pattern allowing between min_elem and max_elem occurrences of elem_pattern"""
        if max_elem == 0:
            return ""

        base_pattern = f"{elem_pattern}"
        suffix_pattern = f"(?:{separator_pattern}{elem_pattern})"

        min_suffix_repeats = "" if min_elem is None else max(0, int(min_elem) - 1)
        max_suffix_repeats = "" if max_elem is None else max_elem - 1

        if max_suffix_repeats == 0:
            pattern = base_pattern
        else:
            pattern = f"{base_pattern}({suffix_pattern}){{{min_suffix_repeats},{max_suffix_repeats}}}"

        padded_pattern = f"({pad}{pattern}{pad})"

        if not min_elem:
            return f"({padded_pattern}|{pad})"
        else:
            return padded_pattern

Copy link

@ekagra-ranjan ekagra-ranjan Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didnt get you. I just meant that right now the regex is this:

rf"\[{whitespace_pattern}({'|'.join(regexes)})(,{whitespace_pattern}({'|'.join(regexes)})){num_repeats}{allow_empty}{whitespace_pattern}\]"

and I think it should be this:

rf"\[({whitespace_pattern}({'|'.join(regexes)})(,{whitespace_pattern}({'|'.join(regexes)})){num_repeats}){allow_empty}{whitespace_pattern}\]"

For e.g its this [(true|false)(,(true|false)){0,}?] vs [((true|false)(,(true|false)){0,})?]. The former wont allow [] whereas the latter will allow it

Does this make sense?

('{"a": 1, "b": null}', True),
('{"a": {"z": {"g": 4}}, "b": null}', True),
("1234", False), # not an object
('["a", "a"]', False), # not an array

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lapp0 are we missing a test case for having arrays inside object?

Copy link
Collaborator Author

@lapp0 lapp0 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! It's important that we create patterns that follow json-schema spec expectations. I'll ensure the two conditions you've referenced are tested in the refactor here lapp0#36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

json_schema.py: If schema has an array with no items key, the pattern is invalid
3 participants