-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify nesting and repetition of geometry columns #138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
format-specs/geoparquet.md
Outdated
|
||
### Repetition | ||
|
||
The repetition for all geometry columns must be "required" (exactly one) or "optional" (zero or one). A geometry column must not be repeated. A GeoParquet file may have multiple geometry columns (with different names), but those geometry columns cannot be repeated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording for the details look good. I am only wondering if we should clarify a bit more what this means (i.e. "cannot be a list/array type column"), for those less familiar with those parquet terminology. Or add a link to the geoparquet documentation that explains those terms (although I am not sure such a link exists .. https://parquet.apache.org/docs/file-format/nestedencoding/ is not really helping)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another subtle source of confusion is that a conceptual list can be represented with "repeated" repetition OR with a LIST
annotation.
Yeah, the documentation is lacking. I find this a bit more useful: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want to say is that geometries cannot appear in nested structures - this rules out LIST
and MAP
. In addition, the repetition for a geometry column must be "optional" or "required".
I tried to capture this by explicitly adding sections for nesting and repetition. But am open to suggested improvements in the wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also include examples and counter-examples.
Given "primary_column": "geometry"
, here is a valid schema with an optional geometry:
message {
optional binary name (STRING);
optional binary geometry;
}
Given "primary_column": "geometry"
, here is a valid schema with a required geometry:
message {
optional binary name (STRING);
required binary geometry;
}
However, given "primary_column": "geometry"
, it is not valid to have a schema with a repeated geometry:
message {
optional binary name (STRING);
repeated binary geometry;
}
In addition, it is not valid to have a geometry in a nested field. For example, you cannot have "primary_column": "place.geometry"
and a schema like this:
message {
optional group place {
optional binary name (STRING);
optional binary geometry;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples sound good on this, since it is pretty confusing if you're not deep in parquet (and I imagine many implementors will not be). I'm not sure if we want a bunch of examples in the spec itself, but maybe some 'side' document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also include examples and counter-examples.
I am not sure I would include those (or at least not in the main text but in some appendix, as @cholmes suggests). I have the feeling that either you are familiar with Parquet details and you will understand what it means that a column can't be repeated (and not be in a LIST or MAP logical type) or in a group, or either you are not familiar with those terms, but then also those examples are too technical?
Another subtle source of confusion is that a conceptual list can be represented with "repeated" repetition OR with a
LIST
annotation.
The logical type LIST annotation is optional, but I think there will always be some "repeated" level if you have a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logical type LIST annotation is optional, but I think there will always be some "repeated" level if you have a list?
@jorisvandenbossche - That's right, but the top level field is not repeated in the case of a logical list (it can only be required
or optional
).
Here is an Example
message schema that uses the logical LIST
for a ListOfString
and just a plain old repeated field for RepeatedString
:
message Example {
required group ListOfString (LIST) {
repeated group list {
required binary element (STRING);
}
}
repeated binary RepeatedString (STRING);
}
The nested group list
has repeated
for the repetition (as the spec requires). But the top level ListOfString
field is not repeated
(and cannot be for a logical list).
So a BadExample
schema for GeoParquet might have top-level ListOfGeometry
and RepeatedGeometry
fields that looked like this:
message BadExample {
required group ListOfGeometry (LIST) {
repeated group list {
required binary element;
}
}
repeated binary RepeatedGeometry;
}
The ListOfGeometry
is a logical LIST
where the element type is binary
. We wouldn't want to allow this type of geometry. And maybe the language in this branch makes that clear by saying geometries in group fields are not allowed.
The RepeatedGeometry
is just a repeated binary
field. We don't allow that either, and the text in this branch tries to make that clear by saying the repetition must be optional
or required
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds fully correct.
Coming back to my initial comment, I think it could still be useful for people less familiar with those parquet details to add something like:
In practice, this means that geometry columns can not be contained in a complex or nested type, such as struct, list/array or map types.
(I don't know if we then should be explicit those terms are meant as general "as could be known like this in other data systems", since for example "struct" or "array" is not a term from Parquet, but it are the terms how readers might know those types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche - I agree that mention of types familiar to users of other systems makes sense. Let me know if you think the latest commit captures it.
f62a3d8
to
55831de
Compare
This adds text to clarify that geometry columns must be at the root of the schema (not nested) and that the repetition must be "optional" or "required" (not "repeated").
Fixes #47.