Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for the 1.0.0-beta.1 release #161

Merged
merged 2 commits into from
Dec 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Release

on:
push:
tags:
- 'v*.*.*'

jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Draft Release
uses: softprops/action-gh-release@v1
with:
draft: true
generate_release_notes: true
files: |
format-specs/geoparquet.md
format-specs/schema.json
Binary file modified examples/example.parquet
Binary file not shown.
2 changes: 1 addition & 1 deletion examples/example_metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,6 @@
}
},
"primary_column": "geometry",
"version": "0.5.0-dev"
"version": "1.0.0-dev"
}
}
80 changes: 19 additions & 61 deletions format-specs/geoparquet.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
# Geospatial Parquet format
# GeoParquet Specification

## Overview

The [Apache Parquet](https://parquet.apache.org/) provides a standardized open-source columnar storage format. This specification defines how geospatial data
should be stored in parquet format, including the representation of geometries and the required additional metadata.
The [Apache Parquet](https://parquet.apache.org/) provides a standardized open-source columnar storage format. The GeoParquet specification defines how geospatial data should be stored in parquet format, including the representation of geometries and the required additional metadata.

**Additional resources:**
* [Examples](../examples/)
Expand All @@ -13,7 +12,7 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "S

## Version

This is version 0.5.0-dev of the GeoParquet specification.
This is version 1.0.0-dev of the GeoParquet specification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this mean that every beta and rc will be 1.0.0-dev? Is there a way to have it be 1.0.0-beta.2-dev?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can adopt a different convention, but I think of the -dev modifier as being applied to the next release that we "want" - so typically a minor release instead of a patch release. We could have this be any placeholder really. It could be unreleased even.


## Geometry columns

Expand Down Expand Up @@ -65,58 +64,37 @@ Each geometry column in the dataset MUST be included in the `columns` field abov

The Coordinate Reference System (CRS) is an optional parameter for each geometry column defined in GeoParquet format.

The CRS MUST be provided in
[PROJJSON](https://proj.org/specifications/projjson.html) format, which is a JSON encoding of
[WKT2:2019 / ISO-19162:2019](https://docs.opengeospatial.org/is/18-010r7/18-010r7.html),
which itself implements the model of
[OGC Topic 2: Referencing by coordinates abstract specification / ISO-19111:2019](http://docs.opengeospatial.org/as/18-005r4/18-005r4.html).
Apart from the difference of encodings, the semantics are intended to match
WKT2:2019, and a CRS in one encoding can generally be represented in the other.
The CRS MUST be provided in [PROJJSON](https://proj.org/specifications/projjson.html) format, which is a JSON encoding of [WKT2:2019 / ISO-19162:2019](https://docs.opengeospatial.org/is/18-010r7/18-010r7.html), which itself implements the model of [OGC Topic 2: Referencing by coordinates abstract specification / ISO-19111:2019](http://docs.opengeospatial.org/as/18-005r4/18-005r4.html). Apart from the difference of encodings, the semantics are intended to match WKT2:2019, and a CRS in one encoding can generally be represented in the other.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file had a lot of inconsistent hard wrapped lines. I took the liberty of removing the line breaks assuming that we all can enable soft wrapping in our editors. In prose that will be getting multiple updates, I think hard wrapping makes a mess of the diffs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it make a mess of diff's? I tend to like line breaks because it makes the diff clearer to me - I don't have to look through a full paragraph of stuff for a little change, or to try to understand 10 different changes at once.

I don't feel super strongly if everyone else prefers no line breaks. But I do think if we have line breaks we should. make them consistent, and use like a markdown linter to enforce it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are only making very small changes, then diffs with hard wrapping can be nice to read. But if you are using hard wrapping, you typically have some line length limit, and when the line length exceeds that, you rewrap all your lines. In this case, making changes that aren't very small result in diffs that are full of noise (due to the rewrapping).

Here are some examples.

A somewhat minor change in text that is initially unwrapped (GitHub highlights the relevant change, I think this is nice): tschaub/line-breaks@unwrapped...soft-change

A somewhat minor change in text that is hard wrapped at 80 columns (lots of noise, who knows what actually changed? I think this is not nice): tschaub/line-breaks@wrapped...hard-change

Both of those changes above are the same change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I think maybe github is now better at the soft changes than it used to be. Don't feel strongly, so new way sounds good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cholmes - I discovered that the soft wrapping only applies to "prose" docs (markdown among them). So for non-prose docs (HTML even), it looks like old fashioned hard wrapping is still the way to go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is semantic line breaks ;) (https://sembr.org/)


If CRS is not provided, all coordinates in the geometries MUST use longitude, latitude based on the WGS84 datum,
and the default value is [OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84) for CRS-aware implementations.
If CRS is not provided, all coordinates in the geometries MUST use longitude, latitude based on the WGS84 datum, and the default value is [OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84) for CRS-aware implementations.

[OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84) is equivalent to the well-known [EPSG:4326](https://epsg.org/crs_4326/WGS-84.html) but changes the axis from latitude-longitude to longitude-latitude.

Due to the large number of CRSes available and the difficulty of implementing all of them, we expect that a number of implementations will start without support for the optional `crs` field.
Users are recommended to store their data in longitude, latitude (OGC:CRS84 or not including the `crs` field) for it to work with the widest number of tools. Data that are more appropriately represented in particular projections may use an alternate coordinate reference system. We expect many tools will support alternate CRSes, but encourage users to check to ensure their chosen tool supports their chosen CRS.
Due to the large number of CRSes available and the difficulty of implementing all of them, we expect that a number of implementations will start without support for the optional `crs` field. Users are recommended to store their data in longitude, latitude (OGC:CRS84 or not including the `crs` field) for it to work with the widest number of tools. Data that are more appropriately represented in particular projections may use an alternate coordinate reference system. We expect many tools will support alternate CRSes, but encourage users to check to ensure their chosen tool supports their chosen CRS.

See below for additional details about representing or identifying OGC:CRS84.

The value of this key may be explicitly set to `null` to indicate that there is no CRS assigned
to this column (CRS is undefined or unknown).
The value of this key may be explicitly set to `null` to indicate that there is no CRS assigned to this column (CRS is undefined or unknown).

#### epoch

In a dynamic CRS, coordinates of a point on the surface of the Earth may
change with time. To be unambiguous, the coordinates must always be qualified
with the epoch at which they are valid.
In a dynamic CRS, coordinates of a point on the surface of the Earth may change with time. To be unambiguous, the coordinates must always be qualified with the epoch at which they are valid.

The optional `epoch` field allows to specify this in case the `crs` field
defines a a dynamic CRS. The coordinate epoch is expressed as a decimal year
(e.g. `2021.47`). Currently, this specification only supports an epoch per
column (and not per geometry).
The optional `epoch` field allows to specify this in case the `crs` field defines a a dynamic CRS. The coordinate epoch is expressed as a decimal year (e.g. `2021.47`). Currently, this specification only supports an epoch per column (and not per geometry).

#### encoding

This is the binary format that the geometry is encoded in.
The string `"WKB"`, signifying Well Known Binary is the only current option, but future versions
of the spec may support alternative encodings. This SHOULD be the ["OpenGIS® Implementation Specification for Geographic information - Simple feature access - Part 1: Common architecture"](https://portal.ogc.org/files/?artifact_id=18241) WKB representation (using codes for 3D geometry types in the \[1001,1007\] range). This encoding is also consistent with the one defined in the ["ISO/IEC 13249-3:2016 (Information technology - Database languages - SQL multimedia and application packages - Part 3: Spatial)"](https://www.iso.org/standard/60343.html) standard.
This is the binary format that the geometry is encoded in. The string `"WKB"`, signifying Well Known Binary is the only current option, but future versions of the spec may support alternative encodings. This SHOULD be the ["OpenGIS® Implementation Specification for Geographic information - Simple feature access - Part 1: Common architecture"](https://portal.ogc.org/files/?artifact_id=18241) WKB representation (using codes for 3D geometry types in the \[1001,1007\] range). This encoding is also consistent with the one defined in the ["ISO/IEC 13249-3:2016 (Information technology - Database languages - SQL multimedia and application packages - Part 3: Spatial)"](https://www.iso.org/standard/60343.html) standard.

Note that the current version of the spec only allows for a subset of WKB: 2D or 3D geometries of the standard geometry types (the Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection geometry types). This means that M values or non-linear geometry types are not yet supported.

#### Coordinate axis order

The axis order of the coordinates in WKB stored in a GeoParquet follows the de facto standard for axis order in WKB and is therefore always
(x, y) where x is easting or longitude and y is northing or latitude. This ordering explicitly overrides the axis order as specified in the CRS.
This follows the precedent of [GeoPackage](https://geopackage.org), see the [note in their spec](https://www.geopackage.org/spec130/#gpb_spec).
The axis order of the coordinates in WKB stored in a GeoParquet follows the de facto standard for axis order in WKB and is therefore always (x, y) where x is easting or longitude and y is northing or latitude. This ordering explicitly overrides the axis order as specified in the CRS. This follows the precedent of [GeoPackage](https://geopackage.org), see the [note in their spec](https://www.geopackage.org/spec130/#gpb_spec).

#### geometry_types

This field captures the geometry types of the geometries in the
column, when known. Accepted geometry types are: `"Point"`, `"LineString"`,
`"Polygon"`, `"MultiPoint"`, `"MultiLineString"`, `"MultiPolygon"`,
`"GeometryCollection"`.
This field captures the geometry types of the geometries in the column, when known. Accepted geometry types are: `"Point"`, `"LineString"`, `"Polygon"`, `"MultiPoint"`, `"MultiLineString"`, `"MultiPolygon"`, `"GeometryCollection"`.

In addition, the following rules are used:

Expand All @@ -125,11 +103,7 @@ In addition, the following rules are used:
- An empty array explicitly signals that the geometry types are not known.
- The geometry types in the list must be unique (e.g. `["Point", "Point"]` is not valid).

It is expected that this field is strictly correct. For
example, if having both polygons and multipolygons, it is not sufficient to
specify `["MultiPolygon"]`, but it is expected to specify
`["Polygon", "MultiPolygon"]`. Or if having 3D points, it is not sufficient to
specify `["Point"]`, but it is expected to list `["Point Z"]`.
It is expected that this field is strictly correct. For example, if having both polygons and multipolygons, it is not sufficient to specify `["MultiPolygon"]`, but it is expected to specify `["Polygon", "MultiPolygon"]`. Or if having 3D points, it is not sufficient to specify `["Point"]`, but it is expected to list `["Point Z"]`.

#### orientation

Expand All @@ -149,25 +123,15 @@ This attribute indicates how to interpret the edges of the geometries: whether t

If no value is set, the default value to assume is `"planar"`.

Note if `edges` is `"spherical"` then it is RECOMMENDED that `orientation` is always ensured to be `"counterclockwise"`. If it is not set, it is not clear how polygons should be interpreted within spherical coordinate systems, which can lead to major analytical errors if interpreted incorrectly.
In this case, software will typically interpret the rings of a polygon such that it encloses at most half of the sphere (i.e. the smallest polygon of both ways it could be interpreted). But the specification itself does not make any guarantee about this.
Note if `edges` is `"spherical"` then it is RECOMMENDED that `orientation` is always ensured to be `"counterclockwise"`. If it is not set, it is not clear how polygons should be interpreted within spherical coordinate systems, which can lead to major analytical errors if interpreted incorrectly. In this case, software will typically interpret the rings of a polygon such that it encloses at most half of the sphere (i.e. the smallest polygon of both ways it could be interpreted). But the specification itself does not make any guarantee about this.

#### bbox

Bounding boxes are used to help define the spatial extent of each geometry column.
Implementations of this schema may choose to use those bounding boxes to filter
partitions (files) of a partitioned dataset.
Bounding boxes are used to help define the spatial extent of each geometry column. Implementations of this schema may choose to use those bounding boxes to filter partitions (files) of a partitioned dataset.

The bbox, if specified, MUST be encoded with an array representing the range of values for each dimension in the
geometry coordinates. For geometries in a geographic coordinate reference system, longitude and latitude values are
listed for the most southwesterly coordinate followed by values for the most northeasterly coordinate. This follows the
GeoJSON specification ([RFC 7946, section 5](https://tools.ietf.org/html/rfc7946#section-5)), which also describes how
to represent the bbox for a set of geometries that cross the antimeridian.
The bbox, if specified, MUST be encoded with an array representing the range of values for each dimension in the geometry coordinates. For geometries in a geographic coordinate reference system, longitude and latitude values are listed for the most southwesterly coordinate followed by values for the most northeasterly coordinate. This follows the GeoJSON specification ([RFC 7946, section 5](https://tools.ietf.org/html/rfc7946#section-5)), which also describes how to represent the bbox for a set of geometries that cross the antimeridian.

For non-geographic coordinate reference systems, the items in the bbox are minimum values for each dimension followed by
maximum values for each dimension. For example, given geometries that have coordinates with two dimensions, the bbox
would have the form `[<xmin>, <ymin>, <xmax>, <ymax>]`. For three dimensions, the bbox would have the form
`[<xmin>, <ymin>, <zmin>, <xmax>, <ymax>, <zmax>]`.
For non-geographic coordinate reference systems, the items in the bbox are minimum values for each dimension followed by maximum values for each dimension. For example, given geometries that have coordinates with two dimensions, the bbox would have the form `[<xmin>, <ymin>, <xmax>, <ymax>]`. For three dimensions, the bbox would have the form `[<xmin>, <ymin>, <zmin>, <xmax>, <ymax>, <zmax>]`.

The bbox values are in the same coordinate reference system as the geometry.

Expand Down Expand Up @@ -219,19 +183,13 @@ The PROJJSON object for OGC:CRS84 is:
}
```

For implementations that operate entirely with longitude, latitude coordinates
and are not CRS-aware or do not have easy access to CRS-aware libraries that can
fully parse PROJJSON, it may be possible to infer that coordinates conform to
the OGC:CRS84 CRS based on elements of the `crs` field. For simplicity, Javascript
object dot notation is used to refer to nested elements.
For implementations that operate entirely with longitude, latitude coordinates and are not CRS-aware or do not have easy access to CRS-aware libraries that can fully parse PROJJSON, it may be possible to infer that coordinates conform to the OGC:CRS84 CRS based on elements of the `crs` field. For simplicity, Javascript object dot notation is used to refer to nested elements.

The CRS is likely equivalent to OGC:CRS84 for a GeoParquet file if the `id` element is present:

* `id.authority` = `"OGC"` and `id.code` = `"CRS84"`
* `id.authority` = `"EPSG"` and `id.code` = `4326` (due to longitude, latitude ordering in this specification)

It is reasonable for implementations to require that one of the above `id`
elements are present and skip further tests to determine if the CRS is
functionally equivalent with OGC:CRS84.
It is reasonable for implementations to require that one of the above `id` elements are present and skip further tests to determine if the CRS is functionally equivalent with OGC:CRS84.

Note: EPSG:4326 and OGC:CRS84 are equivalent with respect to this specification because this specification specifically overrides the coordinate axis order in the `crs` to be longitude-latitude.
2 changes: 1 addition & 1 deletion format-specs/schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"properties": {
"version": {
"type": "string",
"const": "0.5.0-dev"
"const": "1.0.0-dev"
},
"primary_column": {
"type": "string",
Expand Down