Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #53

Merged
merged 2 commits into from
Apr 18, 2022
Merged

Update README.md #53

merged 2 commits into from
Apr 18, 2022

Conversation

jzb
Copy link
Contributor

@jzb jzb commented Mar 26, 2022

Adding proper branding on first use ("Apache Parquet") and pointer to the project's overview rather than a vendor overview.

Adding proper branding on first use ("Apache Parquet") and pointer to the project's overview rather than a vendor overview.
@cholmes
Copy link
Member

cholmes commented Mar 27, 2022

Thanks for the PR!

The proper branding sounds great. For the second I'd like to retain at least a link to the vendor overview, as I find it to be a much better explanation than the project's overview, and many GeoParquet users will have not heard of Parquet at all. But we can have both links, and explain that the second is an explanation from a vendor.

@cayetanobv
Copy link
Collaborator

@cholmes @jzb I think is better to maintain both links:

  • DataBrick's explanation is short and clear as a first reading.
  • The official documents of Parquet are the best reference to deepen.

I would suggest the following change:

- see [what is parquet?](https://databricks.com/glossary/what-is-parquet) and [parquet overview](https://parquet.apache.org/docs/overview/) for more background.

What do you think?

@alasarr
Copy link
Collaborator

alasarr commented Apr 11, 2022

I'm fine with @cayetanobv proposal. Could you add it as a commit suggestion in the current PR?

@jzb
Copy link
Contributor Author

jzb commented Apr 11, 2022

Hi all - sorry, responses were filtered.

I'd strongly prefer that we link to the project as definitive source. If the Parquet project needs to improve their definition, then let's do that - the authoritative word on "what is parquet" should be from Parquet. The other helps reinforce a vendor as the SEO authoritative resource on the topic. That's undesirable.

@cayetanobv
Copy link
Collaborator

Hi @jzb . I think you are right in saying we should link to the Parquet website. But it's true that the official explanation from Parquet project is not very good as an introduction.
@cholmes @alasarr @jorisvandenbossche I agree with @jzb proposal but it would be great if Parquet project could have a more friendly introduction to this file format. Thanks for your help.

Copy link
Member

@cholmes cholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a suggestion to clarify while also including the nice databricks explanation

@cholmes
Copy link
Member

cholmes commented Apr 11, 2022

@jzb - see my latest commit suggestion. I couched the databricks one as a 'vendor explanation', so it doesn't SEO it more. But I believe we need to include a good explanation of that for our users who have never heard of parquet, hadoop or columnar data formats. The current 'overview' just says 'Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.' - that doesn't actually help explain much about it at all or its advantages to our users.

Once the official project has something that really explains it to new users we'd be more than happy to only link to that.

@jorisvandenbossche
Copy link
Collaborator

Agreed that we should link to the official docs, and that we can help improve the explanation there. But as long as the explanation on the official docs is unclear (and actually just wrong, it's not at all tied to Hadoop), I agree with the others that we need to keep some link to a better explanation. It's unfortunate that we need to link to a vendor, but if someone finds some more neutral post with a good explanation, we could also use that (at the time of the original PR that added this text, I did a search finding something, but didn't directly find something else that would be fitting to link to)

@kylebarron
Copy link
Collaborator

kylebarron commented Apr 15, 2022

Interestingly, the Parquet website was recently updated! It looks like the website had been virtually unchanged from 2015 until last month. Now the homepage doesn't reference hadoop at all: https://parquet.apache.org

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...

The site is now in this repo: https://github.com/apache/parquet-site

@jorisvandenbossche
Copy link
Collaborator

Ah, indeed, the main website has indeed been updated with a much better explanation!
(it seems after the launch of the new website last month, someone basically gave the same comment as we were saying here)

The docs link (https://parquet.apache.org/docs/overview/) still has the old explanation, so I would personally still not yet use that for a "what is parquet" link. But maybe the link to the main website is then sufficient?

@cholmes
Copy link
Member

cholmes commented Apr 15, 2022

The main website is definitely much better now. The main bit that I still find lacking on the main site is 'what is a column-oriented data file format'? I'd be happy to link to another definition of that, but right now the databricks site does the best I've seen of answering that question.

I do agree the apache parquet docs overview isn't great, could make sense to just leave it out.

@jzb
Copy link
Contributor Author

jzb commented Apr 17, 2022

I think it's fair to link to a better definition of "what is a column-oriented..." on the Databricks site & "what is Apache Parquet" linking to the Parquet homepage. Does that work?

README.md Outdated Show resolved Hide resolved
@cholmes
Copy link
Member

cholmes commented Apr 18, 2022

Ok, made an attempt to capture that.

@jzb
Copy link
Contributor Author

jzb commented Apr 18, 2022

That works for me - thank you very much!

Co-authored-by: Chris Holmes <chomie@gmail.com>
@alasarr alasarr merged commit 034db4c into opengeospatial:main Apr 18, 2022
@cayetanobv
Copy link
Collaborator

I think it's a good proposal. Thanks all!

@cholmes cholmes added this to the 0.2 milestone Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants