Script to store JSON copy of metadata next to example Parquet files #68

kylebarron · 2022-04-15T02:13:59Z

Changes

New scripts/ folder to hold scripts that we don't expect to be useful outside of the repo.
Python script to read all the Parquet files in the examples/ directory and store a JSON copy of the schema metadata (note, not the entire schema, just the metadata). Using indent=2 and sort_keys=True to ensure JSON files are the same each time.
Poetry dependency specification and lockfile so that scripts are reproducible. I'm a big fan of lockfiles to ensure reproducibility, but if people are opposed to this, we could instruct people to use a pip command.
CI config to run the script on each commit and PR and verify that the checked-in schema is up to date.

To run the script:

cd scripts && poetry install && poetry run python update_example_schemas.py

Closes #65

kylebarron · 2022-04-15T16:56:38Z

scripts/update_example_schemas.py

+def decode_dict(d: Dict[Union[bytes, str], Union[bytes, Any]]) -> Dict[str, Any]:
+    """Decode binary keys and values to string and dict objects"""
+    new_dict = {}
+    for key, val in d.items():
+        new_key = key.decode() if isinstance(key, bytes) else key
+        new_val = val.decode() if isinstance(val, bytes) else val
+        if new_val.lstrip().startswith("{"):
+            new_val = json.loads(new_val)
+
+        new_dict[new_key] = new_val
+
+    return new_dict


Note that if we only care about the metadata in the geo key, this can be simplified to

return json.loads(d[b'geo'])

as in:

geoparquet/examples/example.py

Line 9 in 8a996a3

>>> pprint.pprint(json.loads(pq.read_schema("example.parquet").metadata[b"geo"]))

But I wasn't sure if there would be a use case where we wanted to have other file-level metadata that also gets copied into the JSON file.

Seems useful to me to try to print out other metadata that might be in there. Would help people catch if they accidentally didn't use the geo key, or use it right.

cholmes

This worked well enough for me that I'm happy to approve - I tested it with the nz-buildings example, seemed to get the new JSON in the examples directory.

But I do think it'd be good to have a simple readme, telling users what they need to do in order to use the script. As mr. naive python user I didn't know what to do with poetry.lock or pyproject.toml. I just ran python update_example_schemas.py and it worked. Maybe that is all I need to do, but a readme telling me to do so would make me feel more confident.

cholmes · 2022-04-15T20:46:14Z

scripts/update_example_schemas.py

+def decode_dict(d: Dict[Union[bytes, str], Union[bytes, Any]]) -> Dict[str, Any]:
+    """Decode binary keys and values to string and dict objects"""
+    new_dict = {}
+    for key, val in d.items():
+        new_key = key.decode() if isinstance(key, bytes) else key
+        new_val = val.decode() if isinstance(val, bytes) else val
+        if new_val.lstrip().startswith("{"):
+            new_val = json.loads(new_val)
+
+        new_dict[new_key] = new_val
+
+    return new_dict


Seems useful to me to try to print out other metadata that might be in there. Would help people catch if they accidentally didn't use the geo key, or use it right.

cholmes · 2022-04-15T20:51:16Z

scripts/update_example_schemas.py

+    """
+    folder = Path(__file__).resolve()
+
+    while folder.name != "geoparquet":


Note this fails if the user checked out the repo with a name other than geoparquet. Probably not all that likely to happen, but I happened to have it named differently so it didn't find it. Was easy to fix.

Jesus89

Thanks for the contribution!

I've added some comments. The most relevant one is the one about the content of the metadata.json file.

Jesus89 · 2022-04-18T07:26:33Z

scripts/poetry.lock

@@ -0,0 +1,79 @@
+[[package]]


I wouldn't include a lock file for an internal tool. But no strong opinion on that.

On the contrary, I have a strong preference for a lockfile. Because this is a distributed project, making sure that every contributor has the exact same set of dependencies ensures that we're on the same page and not missing any small bugs.

For example I recently learned that Pyarrow v7 updated their LZ4 compression format when writing files, and so LZ4-compressed parquet files written with pyarrow v6 and pyarrow v7 are not the same. When everyone has the exact same environment, we no longer have misunderstandings over that class of bugs.

Jesus89 · 2022-04-18T07:28:27Z

scripts/pyproject.toml

@@ -0,0 +1,16 @@
+[tool.poetry]


It's fine to use poetry although I've not used it a lot, we should standardize the tools across the repo

I'm also in favor of standardization, though I think it's less important between the scripts and the validator if we plan for the validator to get spun off into its own repo at some point. Mainly, I'm in favor of Poetry because I think lockfiles are important and it seems to be the most ergonomic tool right now for achieving this. We could always do a locked requirements.txt if we needed a different solution.

Jesus89 · 2022-04-18T07:53:24Z

examples/example_metadata.json

@@ -0,0 +1,19 @@
+{


I would encode just the "geo" metadata because this could confuse the reader about the internal structure of metadata.

The real structure is b"geo": "{<geometadata>}" instead of "geo": {<geometadata>}. One option is to store this:

# example_geometadata.json { "columns": {}, "primary_column": "", ... }

My thinking is it would be better to include both, because we could have an example where we show that the user can include both geo metadata and their own custom metadata. That answers a potential question: "I'm already using some custom metadata in my Parquet files; does that conflict with adding GeoParquet metadata?"

It's true that the metadata is stored as bytes and not as strings, but for the informative purposes of this metadata, I don't think omitting the fact that 'geo' is of type bytes rather than str is an issue.

OK, I think we can add later documentation clarifying the format of the JSON file, and refer to the spec for the metadata encoding.

Note that the b"geo" vs "geo" is only a pyarrow interface issue (which exposes metadata as bytes), the actual Parquet metadata key/values are stores as UTF8 strings, which is mentioned in the spec:

geoparquet/format-specs/geoparquet.md

Line 24 in d8592c1

These are both stored under a "geo" key in the parquet metadata (the [`FileMetaData::key_value_metadata`](https://github.com/apache/parquet-format#metadata)) as a JSON-encoded UTF-8 string.

(this was clarified in #6)

scripts/update_example_schemas.py

jorisvandenbossche · 2022-04-19T21:11:41Z

Thanks for setting this up @kylebarron !

Another follow-up (that could be tracked in a new issue) is add some README content showing some commands how to set up the env + how to run the script?

Create copies of schema metadata as json

46d709d

kylebarron commented Apr 15, 2022

View reviewed changes

cholmes approved these changes Apr 15, 2022

View reviewed changes

Jesus89 reviewed Apr 18, 2022

View reviewed changes

cholmes added this to the 0.2 milestone Apr 18, 2022

cholmes mentioned this pull request Apr 18, 2022

Add script functionality to update parquet binaries from JSON examples #69

Closed

Jesus89 merged commit 6b5da4b into opengeospatial:main Apr 19, 2022

kylebarron deleted the kyle/schema-meta-as-json branch April 19, 2022 13:17

kylebarron mentioned this pull request Apr 19, 2022

Update validator with the latest spec changes #70

Merged

kylebarron mentioned this pull request Apr 21, 2022

Add README for scripts folder #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to store JSON copy of metadata next to example Parquet files #68

Script to store JSON copy of metadata next to example Parquet files #68

kylebarron commented Apr 15, 2022 •

edited

Loading

kylebarron Apr 15, 2022

cholmes Apr 15, 2022

cholmes left a comment

cholmes Apr 15, 2022

cholmes Apr 15, 2022

Jesus89 left a comment

Jesus89 Apr 18, 2022

kylebarron Apr 18, 2022 •

edited

Loading

Jesus89 Apr 18, 2022

kylebarron Apr 18, 2022 •

edited

Loading

Jesus89 Apr 18, 2022

kylebarron Apr 18, 2022

Jesus89 Apr 19, 2022

jorisvandenbossche Apr 19, 2022

jorisvandenbossche commented Apr 19, 2022 •

edited

Loading

Script to store JSON copy of metadata next to example Parquet files #68

Script to store JSON copy of metadata next to example Parquet files #68

Conversation

kylebarron commented Apr 15, 2022 • edited Loading

Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cholmes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jesus89 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 19, 2022 • edited Loading

kylebarron commented Apr 15, 2022 •

edited

Loading

kylebarron Apr 18, 2022 •

edited

Loading

kylebarron Apr 18, 2022 •

edited

Loading

jorisvandenbossche commented Apr 19, 2022 •

edited

Loading