Clarify the Parquet FileMetadata value formatting (UTF8 string, JSON-encoded) #6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #7
The keys and values of Parquet key-value metadata is required to be a string. See the definition in the thrift spec at https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L680-L683. In thrift, a
string
is defined as "A text string encoded using UTF-8 encoding".Our current specification of the geo metadata should be fully compliant to this, as it only includes string fields and values, and the whole is JSON encoded, so still a valid UTF8 string.
While removing the TODO item about this, I realized we currently actually don't really mention how the key-values are stored (JSON-encoded string, at least that's what we currently do). So I added a sentence about that to clarify this. This is documenting the current status, although it could maybe also use a broader discussion if we are happy with the JSON encoding (-> opened #7).