Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to store JSON copy of metadata next to example Parquet files #68

Merged
merged 1 commit into from
Apr 19, 2022

Conversation

kylebarron
Copy link
Collaborator

@kylebarron kylebarron commented Apr 15, 2022

Changes

  • New scripts/ folder to hold scripts that we don't expect to be useful outside of the repo.
  • Python script to read all the Parquet files in the examples/ directory and store a JSON copy of the schema metadata (note, not the entire schema, just the metadata). Using indent=2 and sort_keys=True to ensure JSON files are the same each time.
  • Poetry dependency specification and lockfile so that scripts are reproducible. I'm a big fan of lockfiles to ensure reproducibility, but if people are opposed to this, we could instruct people to use a pip command.
  • CI config to run the script on each commit and PR and verify that the checked-in schema is up to date.

To run the script:

cd scripts && poetry install && poetry run python update_example_schemas.py 

Closes #65

Comment on lines +34 to +45
def decode_dict(d: Dict[Union[bytes, str], Union[bytes, Any]]) -> Dict[str, Any]:
"""Decode binary keys and values to string and dict objects"""
new_dict = {}
for key, val in d.items():
new_key = key.decode() if isinstance(key, bytes) else key
new_val = val.decode() if isinstance(val, bytes) else val
if new_val.lstrip().startswith("{"):
new_val = json.loads(new_val)

new_dict[new_key] = new_val

return new_dict
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if we only care about the metadata in the geo key, this can be simplified to

return json.loads(d[b'geo'])

as in:

>>> pprint.pprint(json.loads(pq.read_schema("example.parquet").metadata[b"geo"]))

But I wasn't sure if there would be a use case where we wanted to have other file-level metadata that also gets copied into the JSON file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems useful to me to try to print out other metadata that might be in there. Would help people catch if they accidentally didn't use the geo key, or use it right.

Copy link
Member

@cholmes cholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked well enough for me that I'm happy to approve - I tested it with the nz-buildings example, seemed to get the new JSON in the examples directory.

But I do think it'd be good to have a simple readme, telling users what they need to do in order to use the script. As mr. naive python user I didn't know what to do with poetry.lock or pyproject.toml. I just ran python update_example_schemas.py and it worked. Maybe that is all I need to do, but a readme telling me to do so would make me feel more confident.

Comment on lines +34 to +45
def decode_dict(d: Dict[Union[bytes, str], Union[bytes, Any]]) -> Dict[str, Any]:
"""Decode binary keys and values to string and dict objects"""
new_dict = {}
for key, val in d.items():
new_key = key.decode() if isinstance(key, bytes) else key
new_val = val.decode() if isinstance(val, bytes) else val
if new_val.lstrip().startswith("{"):
new_val = json.loads(new_val)

new_dict[new_key] = new_val

return new_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems useful to me to try to print out other metadata that might be in there. Would help people catch if they accidentally didn't use the geo key, or use it right.

"""
folder = Path(__file__).resolve()

while folder.name != "geoparquet":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this fails if the user checked out the repo with a name other than geoparquet. Probably not all that likely to happen, but I happened to have it named differently so it didn't find it. Was easy to fix.

Copy link
Collaborator

@Jesus89 Jesus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

I've added some comments. The most relevant one is the one about the content of the metadata.json file.

@@ -0,0 +1,79 @@
[[package]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't include a lock file for an internal tool. But no strong opinion on that.

Copy link
Collaborator Author

@kylebarron kylebarron Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the contrary, I have a strong preference for a lockfile. Because this is a distributed project, making sure that every contributor has the exact same set of dependencies ensures that we're on the same page and not missing any small bugs.

For example I recently learned that Pyarrow v7 updated their LZ4 compression format when writing files, and so LZ4-compressed parquet files written with pyarrow v6 and pyarrow v7 are not the same. When everyone has the exact same environment, we no longer have misunderstandings over that class of bugs.

@@ -0,0 +1,16 @@
[tool.poetry]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to use poetry although I've not used it a lot, we should standardize the tools across the repo

Copy link
Collaborator Author

@kylebarron kylebarron Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also in favor of standardization, though I think it's less important between the scripts and the validator if we plan for the validator to get spun off into its own repo at some point. Mainly, I'm in favor of Poetry because I think lockfiles are important and it seems to be the most ergonomic tool right now for achieving this. We could always do a locked requirements.txt if we needed a different solution.

@@ -0,0 +1,19 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would encode just the "geo" metadata because this could confuse the reader about the internal structure of metadata.

The real structure is b"geo": "{<geometadata>}" instead of "geo": {<geometadata>}. One option is to store this:

# example_geometadata.json
{
  "columns": {},
  "primary_column": "",
  ...
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is it would be better to include both, because we could have an example where we show that the user can include both geo metadata and their own custom metadata. That answers a potential question: "I'm already using some custom metadata in my Parquet files; does that conflict with adding GeoParquet metadata?"

It's true that the metadata is stored as bytes and not as strings, but for the informative purposes of this metadata, I don't think omitting the fact that 'geo' is of type bytes rather than str is an issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think we can add later documentation clarifying the format of the JSON file, and refer to the spec for the metadata encoding.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the b"geo" vs "geo" is only a pyarrow interface issue (which exposes metadata as bytes), the actual Parquet metadata key/values are stores as UTF8 strings, which is mentioned in the spec:

These are both stored under a "geo" key in the parquet metadata (the [`FileMetaData::key_value_metadata`](https://github.com/apache/parquet-format#metadata)) as a JSON-encoded UTF-8 string.

(this was clarified in #6)

scripts/update_example_schemas.py Show resolved Hide resolved
@cholmes cholmes added this to the 0.2 milestone Apr 18, 2022
@Jesus89 Jesus89 merged commit 6b5da4b into opengeospatial:main Apr 19, 2022
@kylebarron kylebarron deleted the kyle/schema-meta-as-json branch April 19, 2022 13:17
@jorisvandenbossche
Copy link
Collaborator

jorisvandenbossche commented Apr 19, 2022

Thanks for setting this up @kylebarron !

Another follow-up (that could be tracked in a new issue) is add some README content showing some commands how to set up the env + how to run the script?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store JSON copy of metadata next to example Parquet files
4 participants