Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the Parquet FileMetadata value formatting (UTF8 string, JSON-encoded) #6

Merged

Conversation

jorisvandenbossche
Copy link
Collaborator

@jorisvandenbossche jorisvandenbossche commented Feb 22, 2022

Closes #7

The keys and values of Parquet key-value metadata is required to be a string. See the definition in the thrift spec at https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L680-L683. In thrift, a string is defined as "A text string encoded using UTF-8 encoding".

Our current specification of the geo metadata should be fully compliant to this, as it only includes string fields and values, and the whole is JSON encoded, so still a valid UTF8 string.

While removing the TODO item about this, I realized we currently actually don't really mention how the key-values are stored (JSON-encoded string, at least that's what we currently do). So I added a sentence about that to clarify this. This is documenting the current status, although it could maybe also use a broader discussion if we are happy with the JSON encoding (-> opened #7).

@cholmes cholmes merged commit 3f4099a into opengeospatial:main Mar 1, 2022
@jorisvandenbossche jorisvandenbossche deleted the parquet-metadata-format branch March 1, 2022 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encoding of the key-value metadata when stored in the Parquet FileMetadata
3 participants