-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validator script for Python based on JSON Schema #58
Conversation
So I'm a very naive python user, but just following the readme doesn't work for me:
I assume I'm supposed to do some command that will look at the environment.yml but I don't know what it is. Would be good to include that in the readme. |
If you use Conda, you can install from the Otherwise with pip you can just run |
Ok, haven't managed the conda install yet, though sometime I would like to. I think it needs to be --file, looks like -f is 'force'. But I have some other problem. Did find success with pip install though. The example in the repo validated. But https://storage.googleapis.com/open-geodata/linz-examples/nz-buildings-outlines.parquet didn't. I had to download it locally - any chance the validator could work with online locations? Maybe it just needs some extra command / format that I don't know, but if so would be good to put that in the readme. Once it was local I got an error:
Not sure if that file needs an update for post 0.1.0? It looks like the example had an update right after 0.1.0 release. |
I don't use Conda myself, so maybe my instructions were incorrect 😄 .
Yeah this is possible if the script is updated to use fsspec. See https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems-with-arrow. (Then extending this you could also pass For example this snippet loads only the schema from your file via range requests: import pyarrow.parquet as pq
from pyarrow.fs import PyFileSystem, FSSpecHandler
from fsspec.implementations.http import HTTPFileSystem
# create pyarrow filesystem that wraps the fsspec HTTPFileSystem
pa_fs = PyFileSystem(FSSpecHandler(HTTPFileSystem()))
url = 'https://storage.googleapis.com/open-geodata/linz-examples/nz-buildings-outlines.parquet'
%time schema = pq.read_schema(pa_fs.open_input_file(url))
# CPU times: user 22.8 ms, sys: 6.39 ms, total: 29.2 ms
# Wall time: 343 ms (From that time it's clear only the schema was downloaded, not the entire 400MB file)
Probably an issue with the |
I put up a PR with some Python updates (#62). Setting aside the validator itself, I think the |
It seems like it could be good to put the schema.json in a main directory, like not in 'validator'? Like we can consider it a defining piece of the specification, that various validation tools can use. I think eventually validation tools should move out of the core spec repo. I'm a fan of having it in the repo for now as we get started, but I'd eventually see a place for one or more dedicated validation projects in their own repos (or have validation as a part of other tool sets). But it seems like it'd make sense for them all to refer to the same schema. |
Hi all. I've made some quick changes to make the script more accessible:
The dependencies are attached in the setup.py with fixed values (for now). The goal now is to test it in different operating systems (I have tested it in Ubuntu) and then apply the changes for 0.2.0. I have seen also this contribution from @kylebarron 👍 (#62). I'll merge it first and then adapt my changes |
Awesome! I'll try it in mac os when I get up tomorrow. |
Just tried it out on Mac, it works great! Though we should fix the buildings file, it still has the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
As I said in the comments, I do think we should eventually move validation code out of this repository and into dedicated projects. But I do really like providing a very obvious validator here, that will stay up to date with spec changes, so we should merge this and do a few releases with the validator.
Perhaps we should set up CI so that we're sure the example file stays up to date with the example(s). Or at least write up that a 'release process' should validate that all examples are up to date. Right now it's just 2 examples, but could make sense to put up a few more.
It's probably a larger discussion, but it may make sense to move the json schema to be 'part' of the 'spec', like alongside the text description. I'll try to raise a discussion topic on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Just a few more suggestions. Also fine if we merge and then iterate further
Thanks for your comments and feedback! 👍 I have created a PR to update I've implemented some of the suggestions in this PR. If you are OK with it, @cholmes @kylebarron, I think we can move forward and merge into master to continue iterating. |
I'm fine with merging but I don't have write access 🙂 . Not sure if there's a minimum number of reviewers with write access that we want before merging 🤷♂️ |
So sad that the conda env file is gone. Tested it, a very nice addition 🚀 |
I don't use conda myself but not opposed to having the env file there, as long as it's not too hard to keep in sync with the main dependency list. |
@kylebarron I appreciate it, but mb let's keep it simple? If there will be a need in conda we can reconsider it, could be indeed an extra complexity for now; and I'm the only conda user here... |
Thanks, @pomadchin for merging 😄 Regarding conda, I decided to make it simple using only pip. I don't use conda either but feel free to create a pull request, or an issue with the request. |
Implements #23.
This PR adds a draft for a basic validator using Python. It checks the metadata of a geoparquet file using JSON Schema, but it can be extended to include specific custom validations.
Example
Try to validate this wrong metadata
The output for a correct GeoParquet file