Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance error reporting and handling for large JSON validations #443

Open
NiklasA opened this issue Sep 26, 2024 · 2 comments
Open

Enhance error reporting and handling for large JSON validations #443

NiklasA opened this issue Sep 26, 2024 · 2 comments

Comments

@NiklasA
Copy link

NiklasA commented Sep 26, 2024

First of all, thank you very much for the great work and for providing the datacontract-cli!


While using it, we noticed three topics:

  1. We use the datacontract-cli to check the structure and content of large JSON files. Currently, the fastjsonschema.compile process stops at the first error, which means that if there are multiple errors in the JSON file, the datacontract-cli has to be executed several times until all errors are fixed. We also considered executing the data_contract.test() for each object in the JSON file, but the overhead is too large, resulting in poor performance. Would it be possible to adjust the following call:

    validate = fastjsonschema.compile(
        schema,
        formats={"uuid": r"^[0-9a-fA-F]{8}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{12}$"},
    )

    so that it works more like the following approach:

    validate = fastjsonschema.compile(
        schema,
        formats={"uuid": r"^[0-9a-fA-F]{8}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{12}$"},
    )
    
    def validate_array(data):
        errors = []
        for index, item in enumerate(data):
            try:
                validate(item)
            except JsonSchemaException as e:
                # append error to list
                errors.append({"index": index, "error": str(e)})
        return errors
  2. Additionally, we noticed that in data_contract.py on line 202, the model name is not included in the list of executed checks.

    run.checks.append(
        Check(type=e.type, name=e.name, result=e.result, engine=e.engine, reason=e.reason, **model=e.model**)
    )
  3. Lastly, it would be really helpful for debugging purposes if we could identify which JSON object in an array caused the error. Could the validate_json_stream method be extended so that it identifies the primary key of the related model in the data contract and returns the value from the found attribute along with the error if needed?


Let me know if I can help to create a pull request for those topics!

Best regards
Niklas

@jochenchrist
Copy link
Contributor

Hi @NiklasA,

Thanks for your feedback and these improvement ideas.

  1. I understand the idea and it would certainly be doable. I hesitate a little, what happens when you have millions of JSON records that all have the same schema issue. Would you really like to see millions of errors in the response? We would need to implement a max number of errors (like 100?) here.

  2. Fixed and commited to main.

  3. Agree. Not sure about the primary key, but we can think of an index here in the error message.

You are welcome to contribute a PR for 1 and 3.

@NiklasA
Copy link
Author

NiklasA commented Oct 12, 2024

@jochenchrist okay, thank you! I will work on the PR for 1 and 3. It will probably take 3 - 4 weeks because I'm on vacation next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants