-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script to store JSON copy of metadata next to example Parquet files #68
Script to store JSON copy of metadata next to example Parquet files #68
Conversation
def decode_dict(d: Dict[Union[bytes, str], Union[bytes, Any]]) -> Dict[str, Any]: | ||
"""Decode binary keys and values to string and dict objects""" | ||
new_dict = {} | ||
for key, val in d.items(): | ||
new_key = key.decode() if isinstance(key, bytes) else key | ||
new_val = val.decode() if isinstance(val, bytes) else val | ||
if new_val.lstrip().startswith("{"): | ||
new_val = json.loads(new_val) | ||
|
||
new_dict[new_key] = new_val | ||
|
||
return new_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if we only care about the metadata in the geo
key, this can be simplified to
return json.loads(d[b'geo'])
as in:
geoparquet/examples/example.py
Line 9 in 8a996a3
>>> pprint.pprint(json.loads(pq.read_schema("example.parquet").metadata[b"geo"])) |
But I wasn't sure if there would be a use case where we wanted to have other file-level metadata that also gets copied into the JSON file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems useful to me to try to print out other metadata that might be in there. Would help people catch if they accidentally didn't use the geo key, or use it right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked well enough for me that I'm happy to approve - I tested it with the nz-buildings example, seemed to get the new JSON in the examples directory.
But I do think it'd be good to have a simple readme, telling users what they need to do in order to use the script. As mr. naive python user I didn't know what to do with poetry.lock or pyproject.toml. I just ran python update_example_schemas.py
and it worked. Maybe that is all I need to do, but a readme telling me to do so would make me feel more confident.
def decode_dict(d: Dict[Union[bytes, str], Union[bytes, Any]]) -> Dict[str, Any]: | ||
"""Decode binary keys and values to string and dict objects""" | ||
new_dict = {} | ||
for key, val in d.items(): | ||
new_key = key.decode() if isinstance(key, bytes) else key | ||
new_val = val.decode() if isinstance(val, bytes) else val | ||
if new_val.lstrip().startswith("{"): | ||
new_val = json.loads(new_val) | ||
|
||
new_dict[new_key] = new_val | ||
|
||
return new_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems useful to me to try to print out other metadata that might be in there. Would help people catch if they accidentally didn't use the geo key, or use it right.
""" | ||
folder = Path(__file__).resolve() | ||
|
||
while folder.name != "geoparquet": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this fails if the user checked out the repo with a name other than geoparquet. Probably not all that likely to happen, but I happened to have it named differently so it didn't find it. Was easy to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
I've added some comments. The most relevant one is the one about the content of the metadata.json file.
@@ -0,0 +1,79 @@ | |||
[[package]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't include a lock file for an internal tool. But no strong opinion on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the contrary, I have a strong preference for a lockfile. Because this is a distributed project, making sure that every contributor has the exact same set of dependencies ensures that we're on the same page and not missing any small bugs.
For example I recently learned that Pyarrow v7 updated their LZ4 compression format when writing files, and so LZ4-compressed parquet files written with pyarrow v6 and pyarrow v7 are not the same. When everyone has the exact same environment, we no longer have misunderstandings over that class of bugs.
@@ -0,0 +1,16 @@ | |||
[tool.poetry] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to use poetry although I've not used it a lot, we should standardize the tools across the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also in favor of standardization, though I think it's less important between the scripts and the validator if we plan for the validator to get spun off into its own repo at some point. Mainly, I'm in favor of Poetry because I think lockfiles are important and it seems to be the most ergonomic tool right now for achieving this. We could always do a locked requirements.txt
if we needed a different solution.
@@ -0,0 +1,19 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would encode just the "geo" metadata because this could confuse the reader about the internal structure of metadata.
The real structure is b"geo": "{<geometadata>}"
instead of "geo": {<geometadata>}
. One option is to store this:
# example_geometadata.json
{
"columns": {},
"primary_column": "",
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is it would be better to include both, because we could have an example where we show that the user can include both geo
metadata and their own custom metadata. That answers a potential question: "I'm already using some custom metadata in my Parquet files; does that conflict with adding GeoParquet metadata?"
It's true that the metadata is stored as bytes and not as strings, but for the informative purposes of this metadata, I don't think omitting the fact that 'geo'
is of type bytes
rather than str
is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think we can add later documentation clarifying the format of the JSON file, and refer to the spec for the metadata encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the b"geo"
vs "geo"
is only a pyarrow interface issue (which exposes metadata as bytes), the actual Parquet metadata key/values are stores as UTF8 strings, which is mentioned in the spec:
geoparquet/format-specs/geoparquet.md
Line 24 in d8592c1
These are both stored under a "geo" key in the parquet metadata (the [`FileMetaData::key_value_metadata`](https://github.com/apache/parquet-format#metadata)) as a JSON-encoded UTF-8 string. |
(this was clarified in #6)
Thanks for setting this up @kylebarron ! Another follow-up (that could be tracked in a new issue) is add some README content showing some commands how to set up the env + how to run the script? |
Changes
scripts/
folder to hold scripts that we don't expect to be useful outside of the repo.examples/
directory and store a JSON copy of the schema metadata (note, not the entire schema, just the metadata). Usingindent=2
andsort_keys=True
to ensure JSON files are the same each time.To run the script:
Closes #65