Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readme enhancements #19

Merged
merged 12 commits into from
Mar 4, 2022
Merged

Readme enhancements #19

merged 12 commits into from
Mar 4, 2022

Conversation

cholmes
Copy link
Member

@cholmes cholmes commented Mar 2, 2022

Fleshed out more in the readme.

Closes #10, closes #12, closes #15

Some of these sections can break out to their own pages in the future, but we don't have tons of content so just putting it all in here.
Copy link
Collaborator

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added a bunch of questions/comments

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
There are a few core goals driving the initial development.

* **Enable interoperability among cloud data warehouses** - BigQuery, Snowflake, Redshift and others all support spatial operations but importing and exporting data
with existing formats can be problematic. All support and often recommend Parquet, so defining a solid GeoParquet can help enable interoperability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly understand the importance of those cloud data warehouses, but personally I think interoperability goes beyond that. For example, we are already using GeoParquet as a fast and interoperable format for Python and R users

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, put differently, we are maybe missing a goal here. One of the main reasons we started using this in GeoPandas, is to have fast/efficient, columnar file format to store geospatial vector data (in addition to the traditional shapefile / geopackage / geojson). Or maybe that doesn't need to be listed explicitly? (something like "Enable the Parquet file format to store geospatial data" is already covered by the very first paragraph of this README?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be a good goal to lay that out explicitly, like as the first goal, and make the cloud data warehouse point build on it.

Thinking something like:

  • Build a fast/efficient, columnar file format to store geospatial vector data to enable XXXX

I can use help on what it enables...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, made an attempt at this. Ended up splitting out 'goals' from 'features', and made a stab at explaining it. I've not been deep in the new columnar workflows, so any tweaks to better explain the potential are more than welcome.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build a fast/efficient, columnar file format to store geospatial vector data to enable XXXX

I can use help on what it enables...

I think in the first place, it enables "fast/efficient data access", I would say
(and then also other features from Parquet, like very good compression (so small file sizes), cheap reading of a subset of columns (the columnar nature), the type system (eg nested types), filtering chunks based on column statistics, ..)

But taking a look at your updates now!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's cool. Some geospatial people love really complex data structures, GML went nuts with that stuff. I think we should definitely focus on just the simple features use case and do that really well, but good to know that more complex data structures can be supported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://databricks.com/glossary/what-is-parquet is a great overview - I think we should include a link to it somewhere in the readme. It includes a decent overview of columnar formats. I'll add it in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the nested types are mostly relevant when talking about an arrow-native geometry encoding. For example a nested list of values would be able to describe a column of Points, while still having a favorable in-memory encoding (i.e. flat arrays underneath).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. So we'll leave them off here for now, but maybe mention in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can indeed use those nested data types to store geometries (the arrow-native encoding proposal), but just a note that there are ecosystems where they make heavy use of the nested schemas that Parquet enables (eg I think Spark supports this quite well; I am not super familiar with it, since nested columns are not really supported in python/pandas). For example, it can map nicely to structured data you might encounter in json files, logs, etc.

(this is just a clarification, agreed it is not that important to have it included in this PR)

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
cholmes and others added 4 commits March 2, 2022 08:40
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
cholmes and others added 3 commits March 2, 2022 14:20
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
* Added some 'features' from @jorisvandenbossche suggestions about parquet
* Made clear that it's not so good at dealing with lots of transactions on the data
* added vis.gl to the list of where people are coming from
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
cholmes and others added 3 commits March 3, 2022 07:36
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Copy link
Collaborator

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates, this is looking great!

About parquet and columnar data format advantages
@jorisvandenbossche
Copy link
Collaborator

I am going to merge this, so we have some more content on the repo landing page. More comments here are of course still welcome, we can always do follow-up PRs.

@jorisvandenbossche jorisvandenbossche merged commit d8241a4 into main Mar 4, 2022
@jorisvandenbossche jorisvandenbossche deleted the better-readme branch March 4, 2022 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make explicit that 0.1 only supports 2d Add 'goals' Flesh out readme
4 participants