Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify nesting and repetition of geometry columns #138

Merged
merged 1 commit into from
Nov 17, 2022

Conversation

tschaub
Copy link
Collaborator

@tschaub tschaub commented Nov 7, 2022

This adds text to clarify that geometry columns must be at the root of the schema (not nested) and that the repetition must be "optional" or "required" (not "repeated").

Fixes #47.

Copy link
Collaborator

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!


### Repetition

The repetition for all geometry columns must be "required" (exactly one) or "optional" (zero or one). A geometry column must not be repeated. A GeoParquet file may have multiple geometry columns (with different names), but those geometry columns cannot be repeated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording for the details look good. I am only wondering if we should clarify a bit more what this means (i.e. "cannot be a list/array type column"), for those less familiar with those parquet terminology. Or add a link to the geoparquet documentation that explains those terms (although I am not sure such a link exists .. https://parquet.apache.org/docs/file-format/nestedencoding/ is not really helping)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another subtle source of confusion is that a conceptual list can be represented with "repeated" repetition OR with a LIST annotation.

Yeah, the documentation is lacking. I find this a bit more useful: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we want to say is that geometries cannot appear in nested structures - this rules out LIST and MAP. In addition, the repetition for a geometry column must be "optional" or "required".

I tried to capture this by explicitly adding sections for nesting and repetition. But am open to suggested improvements in the wording.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also include examples and counter-examples.

Given "primary_column": "geometry", here is a valid schema with an optional geometry:

message {
  optional binary name (STRING);
  optional binary geometry;
}

Given "primary_column": "geometry", here is a valid schema with a required geometry:

message {
  optional binary name (STRING);
  required binary geometry;
}

However, given "primary_column": "geometry", it is not valid to have a schema with a repeated geometry:

message {
  optional binary name (STRING);
  repeated binary geometry;
}

In addition, it is not valid to have a geometry in a nested field. For example, you cannot have "primary_column": "place.geometry" and a schema like this:

message {
  optional group place {
    optional binary name (STRING);
    optional binary geometry;
  }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples sound good on this, since it is pretty confusing if you're not deep in parquet (and I imagine many implementors will not be). I'm not sure if we want a bunch of examples in the spec itself, but maybe some 'side' document.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also include examples and counter-examples.

I am not sure I would include those (or at least not in the main text but in some appendix, as @cholmes suggests). I have the feeling that either you are familiar with Parquet details and you will understand what it means that a column can't be repeated (and not be in a LIST or MAP logical type) or in a group, or either you are not familiar with those terms, but then also those examples are too technical?

Another subtle source of confusion is that a conceptual list can be represented with "repeated" repetition OR with a LIST annotation.

The logical type LIST annotation is optional, but I think there will always be some "repeated" level if you have a list?

Copy link
Collaborator Author

@tschaub tschaub Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logical type LIST annotation is optional, but I think there will always be some "repeated" level if you have a list?

@jorisvandenbossche - That's right, but the top level field is not repeated in the case of a logical list (it can only be required or optional).

Here is an Example message schema that uses the logical LIST for a ListOfString and just a plain old repeated field for RepeatedString:

message Example {
  required group ListOfString (LIST) {
    repeated group list {
      required binary element (STRING);
    }
  }
  repeated binary RepeatedString (STRING);
}

The nested group list has repeated for the repetition (as the spec requires). But the top level ListOfString field is not repeated (and cannot be for a logical list).

So a BadExample schema for GeoParquet might have top-level ListOfGeometry and RepeatedGeometry fields that looked like this:

message BadExample {
  required group ListOfGeometry (LIST) {
    repeated group list {
      required binary element;
    }
  }
  repeated binary RepeatedGeometry;
}

The ListOfGeometry is a logical LIST where the element type is binary. We wouldn't want to allow this type of geometry. And maybe the language in this branch makes that clear by saying geometries in group fields are not allowed.

The RepeatedGeometry is just a repeated binary field. We don't allow that either, and the text in this branch tries to make that clear by saying the repetition must be optional or required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds fully correct.

Coming back to my initial comment, I think it could still be useful for people less familiar with those parquet details to add something like:

In practice, this means that geometry columns can not be contained in a complex or nested type, such as struct, list/array or map types.

(I don't know if we then should be explicit those terms are meant as general "as could be known like this in other data systems", since for example "struct" or "array" is not a term from Parquet, but it are the terms how readers might know those types)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche - I agree that mention of types familiar to users of other systems makes sense. Let me know if you think the latest commit captures it.

@tschaub tschaub force-pushed the clarify-geometry-columns branch 3 times, most recently from f62a3d8 to 55831de Compare November 14, 2022 15:24
@tschaub tschaub merged commit 3a38802 into opengeospatial:main Nov 17, 2022
@tschaub tschaub deleted the clarify-geometry-columns branch November 17, 2022 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify usage with nested and repeated columns
3 participants