-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readme enhancements #19
Conversation
Some of these sections can break out to their own pages in the future, but we don't have tons of content so just putting it all in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added a bunch of questions/comments
There are a few core goals driving the initial development. | ||
|
||
* **Enable interoperability among cloud data warehouses** - BigQuery, Snowflake, Redshift and others all support spatial operations but importing and exporting data | ||
with existing formats can be problematic. All support and often recommend Parquet, so defining a solid GeoParquet can help enable interoperability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I certainly understand the importance of those cloud data warehouses, but personally I think interoperability goes beyond that. For example, we are already using GeoParquet as a fast and interoperable format for Python and R users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, put differently, we are maybe missing a goal here. One of the main reasons we started using this in GeoPandas, is to have fast/efficient, columnar file format to store geospatial vector data (in addition to the traditional shapefile / geopackage / geojson). Or maybe that doesn't need to be listed explicitly? (something like "Enable the Parquet file format to store geospatial data" is already covered by the very first paragraph of this README?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be a good goal to lay that out explicitly, like as the first goal, and make the cloud data warehouse point build on it.
Thinking something like:
- Build a fast/efficient, columnar file format to store geospatial vector data to enable XXXX
I can use help on what it enables...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, made an attempt at this. Ended up splitting out 'goals' from 'features', and made a stab at explaining it. I've not been deep in the new columnar workflows, so any tweaks to better explain the potential are more than welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build a fast/efficient, columnar file format to store geospatial vector data to enable XXXX
I can use help on what it enables...
I think in the first place, it enables "fast/efficient data access", I would say
(and then also other features from Parquet, like very good compression (so small file sizes), cheap reading of a subset of columns (the columnar nature), the type system (eg nested types), filtering chunks based on column statistics, ..)
But taking a look at your updates now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's cool. Some geospatial people love really complex data structures, GML went nuts with that stuff. I think we should definitely focus on just the simple features use case and do that really well, but good to know that more complex data structures can be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://databricks.com/glossary/what-is-parquet is a great overview - I think we should include a link to it somewhere in the readme. It includes a decent overview of columnar formats. I'll add it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the nested types are mostly relevant when talking about an arrow-native geometry encoding. For example a nested list of values would be able to describe a column of Point
s, while still having a favorable in-memory encoding (i.e. flat arrays underneath).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool. So we'll leave them off here for now, but maybe mention in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can indeed use those nested data types to store geometries (the arrow-native encoding proposal), but just a note that there are ecosystems where they make heavy use of the nested schemas that Parquet enables (eg I think Spark supports this quite well; I am not super familiar with it, since nested columns are not really supported in python/pandas). For example, it can map nicely to structured data you might encounter in json files, logs, etc.
(this is just a clarification, agreed it is not that important to have it included in this PR)
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
* Added some 'features' from @jorisvandenbossche suggestions about parquet * Made clear that it's not so good at dealing with lots of transactions on the data * added vis.gl to the list of where people are coming from
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates, this is looking great!
About parquet and columnar data format advantages
I am going to merge this, so we have some more content on the repo landing page. More comments here are of course still welcome, we can always do follow-up PRs. |
Fleshed out more in the readme.
Closes #10, closes #12, closes #15