Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to align with geoarrow spec #4

Merged
merged 7 commits into from
Oct 21, 2021
Merged

Conversation

cholmes
Copy link
Member

@cholmes cholmes commented Oct 19, 2021

This PR aims to align the geoparquet with the GeoArrow proposal. We hope to fully align, with this spec as the parquet instantiation of all the same geoarrow principles.

For now I tried to align all the names / structure. @TomAugspurger - would be great if you could check if I got it right. They use a JSON description, which is nice, but wouldn't actually be used in parquet right? So I think we should describe it without that, though might be nice to have some sort of 'example' that is just the tables as they'd be in parquet?

I did slim a couple things they have:

  • Removed library and version of the 'creator'. Mostly because I've not seen any precedence for this in geospatial software, so I think many implementors might be a bit confused, and may not keep it up to date, etc. But I do think it's a cool idea, so would be open to making it optional? Either way should give people guidance on what they put there.
  • Made CRS just wkt2 - I think the spec should be narrow on this, as not everybody will use the proj library. And then strongly recommended that it's always 4326.

I also changed geoparquet_version to be just 'version'. They had a 'schema_version' in the spec, but then just 'version' in the example. I liked just 'version'. Though maybe we should have a specific geoparquet version?

I also left out the bounding box stuff, though it does seem interesting. If we want to explain how it'd work for parquet partitioning use cases then I'd be for having it in.

Copy link
Collaborator

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

92be42b updated the example to match the new spec (with my other two comments).

I need to verify two things:

  1. The WKT CRS is different
GEOGCRS["WGS 84",ENSEMBLE["World Geodetic System 1984 ensemble",MEMBER["World Geodetic System 1984 (Transit)"],MEMBER["World Geodetic System 1984 (G730)"],MEMBER["World Geodetic System 1984 (G873)"],MEMBER["World Geodetic System 1984 (G1150)"],MEMBER["World Geodetic System 1984 (G1674)"],MEMBER["World Geodetic System 1984 (G1762)"],ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1]],ENSEMBLEACCURACY[2.0]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],CS[ellipsoidal,2],AXIS["geodetic latitude (Lat)",north,ORDER[1],ANGLEUNIT["degree",0.0174532925199433]],AXIS["geodetic longitude (Lon)",east,ORDER[2],ANGLEUNIT["degree",0.0174532925199433]],USAGE[SCOPE["Horizontal component of 3D system."],AREA["World."],BBOX[-90,-180,90,180]],ID["EPSG",4326]]
  1. I need to confirm that this is actually written to the parquet metadata, and not just the pyarrow metadata (doesn't have to hold up this PR).

format-specs/geoparquet.md Outdated Show resolved Hide resolved
format-specs/geoparquet.md Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Collaborator

On the bounding boxes, there's a bit of motivation at https://github.com/geopandas/geo-arrow-spec/pull/2/files#diff-6cf20cd11635a4f1fc656313a8d9e2a14f62dc7b68d778deac688b4dd051ffe2R52.

@jorisvandenbossche, do you know if dask-geopandas is actually using bbox from the draft geo-arrow-spec? It seems like it's not. IIUC it's using a separate top-level key: https://github.com/geopandas/dask-geopandas/blob/393dcb558dfaa8bfbfc0cd98a8a52891b8046186/dask_geopandas/io/parquet.py#L142 (I don't really know how file-level metadata like this interacts with the parquet dataset and separate partitions of that dataset).

@TomAugspurger
Copy link
Collaborator

I made a mess of the git history, so we might want to squash this before merging, but the CRS example should be fixed now by using "simplified" WKT2.

@cholmes
Copy link
Member Author

cholmes commented Oct 21, 2021

I made a mess of the git history, so we might want to squash this before merging, but the CRS example should be fixed now by using "simplified" WKT2.

Squashing is still beyond my git skills, and I don't really care about the history, so I'm fine to just merge. If you want to squash then go for it, just let me know, as I'd like to merge in the next couple hours.

On the bounding boxes, there's a bit of motivation at https://github.com/geopandas/geo-arrow-spec/pull/2/files#diff-6cf20cd11635a4f1fc656313a8d9e2a14f62dc7b68d778deac688b4dd051ffe2R52.

Oh yeah, I read that, and think it's a cool feature. It just seems like instead of saying ' For example, Parquet supports partitions as individual files' we should actually explain how Parquet can make use of this feature. I'm way out of my depth here, so not sure how it works, but I'd see us putting up a partition example, and explaining how to actually set it up? Or is it all automatic? I assume there needs to be some guidance for how to use bbox's for the file partitions. So I just left it out since I can't explain it. If you want to take a crack at it that'd be awesome.

@TomAugspurger
Copy link
Collaborator

GitHub offers a "Squash and merge" option in the dropdown.

image

+1 to leaving out bbox stuff for now. I know that dask-geopandas is successfully doing something today to enable spatial queries on spatially-partitioned Parquet datasets, but I'll need to look into the details.

@TomAugspurger TomAugspurger merged commit 9699d88 into main Oct 21, 2021
@cholmes
Copy link
Member Author

cholmes commented Oct 21, 2021

GitHub offers a "Squash and merge" option in the dropdown.

Nice! I suppose I'll be able to do that in the future ;)

@jorisvandenbossche
Copy link
Collaborator

@jorisvandenbossche, do you know if dask-geopandas is actually using bbox from the draft geo-arrow-spec? It seems like it's not. IIUC it's using a separate top-level key: https://github.com/geopandas/dask-geopandas/blob/393dcb558dfaa8bfbfc0cd98a8a52891b8046186/dask_geopandas/io/parquet.py#L142 (I don't really know how file-level metadata like this interacts with the parquet dataset and separate partitions of that dataset).

Yes, we are using that (it's quite essential for having the efficient spatial partitioning with a partitioned parquet dataset). See here
https://github.com/geopandas/dask-geopandas/blob/393dcb558dfaa8bfbfc0cd98a8a52891b8046186/dask_geopandas/io/parquet.py#L50-L54 (the code line that you linked to is the aggregated bboxes from all files, that is passed through with the meta object to the dask_geopandas.GeoDataFrame)

With dask-geopandas, we can sort the data spatially, so each partition of the dask.dataframe is a part of the whole dataset that is spatially somewhat close to each other. When writing that to a partitioned parquet dataset, each partition of the dask.dataframe becomes a file in the parquet dataset, and we set the bbox information of the partition in that file. On reading this dataset again, this allows to only read those files needed when doing a spatial query.

@@ -5,6 +5,10 @@
The [Apache Parquet][parquet] provides a standardized open-source columnar storage format. This specification defines how geospatial data
should be stored in parquet format, including the representation of geometries and the required additional metadata.

## Version

This is version 0.1.0 of the geoparquest specification.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geoparquest -> geoparquet

schema = (
table.schema.remove(idx)
.insert(idx, field)
table.schema
.with_metadata({"geoparquet": json.dumps(metadata)})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"geoparquet" -> "geo"? (the text below seems to been updated to change this)


#### crs

It is strongly recommended to use [EPSG:4326 (lat, long)](https://spatialreference.org/ref/epsg/4326/) for all data, so in most cases the value of the crs should be:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's explicitly mentioned here that EPSG:4326 is "lat, long", which is correct. But in practice most GIS systems use "lon, lat" to store geometries, and the same for common file formats like GeoJSON and GeoPackage (AFAIK, see eg opengeospatial/geopackage#540).

So I am assuming that the WKB representation will actually use "lon, lat" as well in this spec. In which case the explicit call out of "lat, long" is a bit confusing. But maybe my assumption is wrong, and this is something we need to discuss?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not 100% sure of the default CRS to use. If WKB is long, lat then I agree it makes sense to use crs84. @ghobona - what's the wkt2 for crs84? And is there an entry on spatialreference.org? It'd be nice to link to a web page and not just some big xml document.

@ghobona
Copy link
Contributor

ghobona commented Oct 21, 2021

EPSG:4326 is WGS84 with coordinates arranged as lat,lon

CRS84 is WGS84 with coordinates arranged as lon,lat

@ghobona
Copy link
Contributor

ghobona commented Oct 25, 2021

@cholmes If you are looking for an HTML rendering of the definition of the EPSG:4326 coordinate reference system, then please use this page website from the official IOGP website.

For any questions about the IOGP EPSG registry, please ask @RogerLott

@cholmes
Copy link
Member Author

cholmes commented Oct 25, 2021

@ghobona - well what I'm really looking for (I think) is a html page for CRS84. Is there an epsg code / web page for that?

@ghobona
Copy link
Contributor

ghobona commented Oct 25, 2021

@cholmes I have not come across a registration of CRS84 on the IOGP website.

The OGC Definitions Server does not currently offer HTML rendering of CRS definitions.

@RogerLott
Copy link

@cholmes
@ghobona is correct. EPSG does not support CRS84. There is no EPSG code for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants