Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validator script for Python based on JSON Schema #58

Merged
merged 10 commits into from
Apr 13, 2022

Conversation

Jesus89
Copy link
Collaborator

@Jesus89 Jesus89 commented Apr 5, 2022

Implements #23.

This PR adds a draft for a basic validator using Python. It checks the metadata of a geoparquet file using JSON Schema, but it can be extended to include specific custom validations.

Example

Try to validate this wrong metadata

metadata = {
    "version": "0.x.0",
    "primary_column": "geom",
    "columns": {
        "geometry": {
            "encoding": "WKT",
            "edges": "",
            "bbox": [180, 90, 200, -90],
        },
    },
}
$ python3 validator/validate.py examples/example-wrong.parquet
Validating file...
- [ERROR] $.version: '0.x.0' does not match '^0\\.1\\.[0-9]+$'
          INFO: The version of the geoparquet metadata standard used when writing.
- [ERROR] $.columns.geometry.encoding: 'WKT' is not one of ['WKB']
          INFO: Name of the geometry encoding format. Currently only 'WKB' is supported.
- [ERROR] $.columns.geometry.edges: '' is not one of ['planar', 'spherical']
          INFO: Name of the coordinate system for the edges. Must be one of 'planar' or 'spherical'. The default value is 'planar'.
- [ERROR] $.columns.geometry.bbox[2]: 200 is greater than the maximum of 180
          INFO: The eastmost constant longitude line that bounds the rectangle (xmax).
- [ERROR] $.columns.geometry: 'crs' is a required property
- [ERROR] $.primary_column: must be in $.columns
This is an invalid GeoParquet file.

The output for a correct GeoParquet file

$ python3 validator/validate.py examples/example.parquet
Validating file...
This is a valid GeoParquet file.

validator/validate.py Outdated Show resolved Hide resolved
validator/validate.py Outdated Show resolved Hide resolved
@Jesus89 Jesus89 marked this pull request as ready for review April 11, 2022 15:48
@cholmes
Copy link
Member

cholmes commented Apr 11, 2022

So I'm a very naive python user, but just following the readme doesn't work for me:

cholmes@x9t validator % python3 validate.py ../examples/example.parquet 
Traceback (most recent call last):
  File "/Users/cholmes/Repos/geoparquet/validator/validate.py", line 5, in <module>
    import pyarrow.parquet as pq
ModuleNotFoundError: No module named 'pyarrow'

I assume I'm supposed to do some command that will look at the environment.yml but I don't know what it is. Would be good to include that in the readme.

@kylebarron
Copy link
Collaborator

If you use Conda, you can install from the environment.yml file. Either conda install -f environment.yml (installs into your global env iirc) or conda create -n env_name -f environment.yml.

Otherwise with pip you can just run pip install pyarrow jsonschema.

@cholmes
Copy link
Member

cholmes commented Apr 11, 2022

Ok, haven't managed the conda install yet, though sometime I would like to. I think it needs to be --file, looks like -f is 'force'. But I have some other problem.

Did find success with pip install though. The example in the repo validated. But https://storage.googleapis.com/open-geodata/linz-examples/nz-buildings-outlines.parquet didn't.

I had to download it locally - any chance the validator could work with online locations? Maybe it just needs some extra command / format that I don't know, but if so would be good to put that in the readme.

Once it was local I got an error:

(base) cholmes@x9t37t validator % python3 validate.py nz-buildings-outlines.parquet            
Validating file...
Traceback (most recent call last):
  File "/Users/cholmes/Repos/geoparquet/validator/validate.py", line 42, in <module>
    validate_geoparquet(sys.argv[1])
  File "/Users/cholmes/Repos/geoparquet/validator/validate.py", line 26, in validate_geoparquet
    print(f"- [ERROR] {error.json_path}: {error.message}")
AttributeError: 'ValidationError' object has no attribute 'json_path'

Not sure if that file needs an update for post 0.1.0? It looks like the example had an update right after 0.1.0 release.

@kylebarron
Copy link
Collaborator

kylebarron commented Apr 11, 2022

I don't use Conda myself, so maybe my instructions were incorrect 😄 .

any chance the validator could work with online locations

Yeah this is possible if the script is updated to use fsspec. See https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems-with-arrow. (Then extending this you could also pass s3://, gcs:// etc locations into the validator).

For example this snippet loads only the schema from your file via range requests:

import pyarrow.parquet as pq
from pyarrow.fs import PyFileSystem, FSSpecHandler
from fsspec.implementations.http import HTTPFileSystem 

# create pyarrow filesystem that wraps the fsspec HTTPFileSystem
pa_fs = PyFileSystem(FSSpecHandler(HTTPFileSystem()))

url = 'https://storage.googleapis.com/open-geodata/linz-examples/nz-buildings-outlines.parquet'
%time schema = pq.read_schema(pa_fs.open_input_file(url))
# CPU times: user 22.8 ms, sys: 6.39 ms, total: 29.2 ms
# Wall time: 343 ms

(From that time it's clear only the schema was downloaded, not the entire 400MB file)

AttributeError: 'ValidationError' object has no attribute 'json_path'

Probably an issue with the jsonschema package version but I don't know myself.

@cayetanobv
Copy link
Collaborator

@Jesus89 take into account this PR #60 related to crs property as optional

@kylebarron kylebarron mentioned this pull request Apr 12, 2022
@kylebarron
Copy link
Collaborator

I put up a PR with some Python updates (#62). Setting aside the validator itself, I think the schema.json file should be merged ASAP so that we can start making updates to the schema definition when PRs to the spec are landed.

@cholmes
Copy link
Member

cholmes commented Apr 12, 2022

It seems like it could be good to put the schema.json in a main directory, like not in 'validator'? Like we can consider it a defining piece of the specification, that various validation tools can use.

I think eventually validation tools should move out of the core spec repo. I'm a fan of having it in the repo for now as we get started, but I'd eventually see a place for one or more dedicated validation projects in their own repos (or have validation as a part of other tool sets). But it seems like it'd make sense for them all to refer to the same schema.

@Jesus89
Copy link
Collaborator Author

Jesus89 commented Apr 12, 2022

Hi all. I've made some quick changes to make the script more accessible:

  • Convert into cli geoparquet_validator (it can be installed and used anywhere)
  • Support for remote files
  • Update the documentation (installation and usage)

image

The dependencies are attached in the setup.py with fixed values (for now). The goal now is to test it in different operating systems (I have tested it in Ubuntu) and then apply the changes for 0.2.0.

I have seen also this contribution from @kylebarron 👍 (#62). I'll merge it first and then adapt my changes

@cholmes
Copy link
Member

cholmes commented Apr 12, 2022

Awesome! I'll try it in mac os when I get up tomorrow.

@Jesus89
Copy link
Collaborator Author

Jesus89 commented Apr 12, 2022

This is how it looks now, showing the geo metadata:

image

image

I've moved the validator/ directory so it can be used with other implementations too.

@cholmes
Copy link
Member

cholmes commented Apr 12, 2022

Just tried it out on Mac, it works great!

Though we should fix the buildings file, it still has the schema_version instead of version. If someone can fix that one and link to it for me I can replace it on google cloud.

Copy link
Member

@cholmes cholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

As I said in the comments, I do think we should eventually move validation code out of this repository and into dedicated projects. But I do really like providing a very obvious validator here, that will stay up to date with spec changes, so we should merge this and do a few releases with the validator.

Perhaps we should set up CI so that we're sure the example file stays up to date with the example(s). Or at least write up that a 'release process' should validate that all examples are up to date. Right now it's just 2 examples, but could make sense to put up a few more.

It's probably a larger discussion, but it may make sense to move the json schema to be 'part' of the 'spec', like alongside the text description. I'll try to raise a discussion topic on it.

Copy link
Collaborator

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Just a few more suggestions. Also fine if we merge and then iterate further

validator/python/.gitignore Outdated Show resolved Hide resolved
validator/python/README.md Outdated Show resolved Hide resolved
validator/python/geoparquet_validator/__init__.py Outdated Show resolved Hide resolved
validator/schema.json Outdated Show resolved Hide resolved
validator/python/setup.py Outdated Show resolved Hide resolved
validator/python/setup.py Show resolved Hide resolved
@Jesus89
Copy link
Collaborator Author

Jesus89 commented Apr 13, 2022

Thanks for your comments and feedback! 👍

I have created a PR to update version to schema_version, feel free to take a look: #63. I like the idea of CI, could be interesting to make the spec more stable. And sure, the JSON Schema should be part of the spec.

I've implemented some of the suggestions in this PR. If you are OK with it, @cholmes @kylebarron, I think we can move forward and merge into master to continue iterating.

@kylebarron
Copy link
Collaborator

kylebarron commented Apr 13, 2022

I'm fine with merging but I don't have write access 🙂 . Not sure if there's a minimum number of reviewers with write access that we want before merging 🤷‍♂️

@pomadchin
Copy link
Collaborator

pomadchin commented Apr 13, 2022

So sad that the conda env file is gone.

Tested it, a very nice addition 🚀
I decided to merge it rightaway, can be improved later in a separate PR if needed; there is a lot of outstanding PRs already.

@pomadchin pomadchin merged commit 8a996a3 into main Apr 13, 2022
@kylebarron
Copy link
Collaborator

So sad that the conda env file is gone.

I don't use conda myself but not opposed to having the env file there, as long as it's not too hard to keep in sync with the main dependency list.

@pomadchin
Copy link
Collaborator

pomadchin commented Apr 13, 2022

@kylebarron I appreciate it, but mb let's keep it simple? If there will be a need in conda we can reconsider it, could be indeed an extra complexity for now; and I'm the only conda user here...

@Jesus89 Jesus89 deleted the jarroyo/validator branch April 13, 2022 16:53
@Jesus89
Copy link
Collaborator Author

Jesus89 commented Apr 13, 2022

Thanks, @pomadchin for merging 😄

Regarding conda, I decided to make it simple using only pip. I don't use conda either but feel free to create a pull request, or an issue with the request.

@cholmes cholmes added this to the 0.2 milestone Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants