Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Media type #115

Closed
m-mohr opened this issue Jul 1, 2022 · 19 comments · Fixed by #213
Closed

Media type #115

m-mohr opened this issue Jul 1, 2022 · 19 comments · Fixed by #213
Milestone

Comments

@m-mohr
Copy link
Collaborator

m-mohr commented Jul 1, 2022

Hi there,
is there already an agreed media type (e.g. for usage in STAC)?

Related issue for parquet: https://issues.apache.org/jira/browse/PARQUET-1889

Maybe something like: application/vnd.geo+apache.parquet or application/geo+vnd.apache.parquet?

@jorisvandenbossche
Copy link
Collaborator

I suppose having a registered MIME type for Parquet itself (the issue you linked) would be a required first step (since a geoparquet file is technically just a parquet file).

@m-mohr
Copy link
Collaborator Author

m-mohr commented Jul 12, 2022

Yes, that's true. But on the other hand if it takes too long to wait for it in Apache, the community will just come up with their own definition.

Microsoft Planetary Computer is currently using application/x-parquet, which I'll also adopt for now.

@TomAugspurger
Copy link
Collaborator

It would be good to establish a convention that differentiates between parquet and geoparquet. I'm happy to update the media types in the Planetary Computer as needed.

I'm not familiar with the various components of a media type, but COG uses image/tiff; application=geotiff; profile=cloud-optimized. If Apache Parquet used application/apache.parquet would it be appropriate to use application/apache.parquet; application=geoparquet, since geoparquet files are valid parquet files? Maybe it'd be weird to have application in the media type twice.

@rouault
Copy link
Contributor

rouault commented Aug 5, 2022

If Apache Parquet used application/apache.parquet would it be appropriate to use application/apache.parquet; application=geoparquet, since geoparquet files are valid parquet files? Maybe it'd be weird to have application in the media type twice.

Not necessarily a helpful answer, but having followed a bit the discussions regarding COG (and not sure we managed to reach an endorsed conclusion), I find that understanding IANA rules for MIME types tends to require a dedicated expertise. For example https://www.rfc-editor.org/rfc/rfc6838.html#page-13 mentions " Media types MAY elect to use one or more media type parameters[...] the names, values, and meanings of any parameters MUST be fully specified when a media type is registered in the standards tree". So I guess that a provision for allowing application=geoparquet should be made at the time application/apache.parquet is registered (unless other rules just ban it or make it already possible...)

@m-mohr
Copy link
Collaborator Author

m-mohr commented Aug 5, 2022

OGC contacted IANA to ask about adding parameters such as profile or application (to GeoTiff in this case) and they said that you can't simply "add" them if the original type is already registered. So you'd need to discuss that with Apache upfront or otherwise register your own, e.g. application/geo+apache.parquet (although I'm not sure they would use a . in the name...)

@cholmes cholmes added this to the 1.0.0-beta.2 milestone Oct 24, 2022
@cholmes
Copy link
Member

cholmes commented Oct 24, 2022

Moving to beta.2. I think we should also try to get in touch with the core Parquet people and see if we can help them register something, even if it's just application/vnd.apache.parquet.

There was also general consensus on a recent call that we don't want a geo-specific parquet mimetype, we'd just use the general parquet one, and users would rely on the presence of geo metadata in the parquet file (or they could also guess that it'd be likely if there are shapefiles / geopackages of the same data). Happy for arguments on why we should have our own special geo one, and sounds like the time to do so is when apache applies for theirs. But I think we don't really want geospatial systems that distribute a non-geo parquet and a geoparquet. And we also hope that eventually all parquet readers would at least know to identify the standard geo data.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Oct 24, 2022

The same reasons why we want a COG and GeoTiff specific media type over just using "image/tiff" also applies here. If you need to read the file anyway (partially), then you could also just omit media types completely.
Example: Without specific media types, STAC Browser would try to visualize a 100GB non-GeoTiff TIFF on a map, which is not a nice user experience and likely crash your browser. But by knowing it's a COG I can visualize just those and reject all the others.
So +1 on having at least geo-specific parameter...

@cholmes
Copy link
Member

cholmes commented Oct 25, 2022

I don't feel that strongly about this on either way, except that an optional parameter (like application/parquet; profile=geo or whatever) seems better to me than the geo+parquet, since then a generic parquet reader would have a better chance of actually opening it.

But I do think the time is 'now' to decide what we want, since as pointed out above the only real chance IANA seems to give for optional parameters is on registration. We can likely help the Apache people with the process, since OGC has experience working with IANA, and we can also just point them to the form to do a vnd registration - you just fill out https://www.iana.org/form/media-types. It does seem like there's precedent for a 'project steering committee' to submit for the official IANA types, with Apache Arrow, Thrift and Node all being submitted by the steering committees or a member on them. But @ogcscotts can likely help navigate the process / talk to the right people.

So seems like we should determine which direction we want to go, and if we want an optional parameter we should determine what we'd like, before engaging with them.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Oct 25, 2022

I agree that having a parameter is better than the geo+ thing. So a parameter would be the best option, no strong preference on which to use though so profile=geo is fine.

@cholmes cholmes modified the milestones: 1.0.0-rc.1, 1.0.1 Jul 26, 2023
@kylebarron
Copy link
Collaborator

As @jorisvandenbossche mentioned in the meeting today, there's currently progress on Parquet getting a MIME/Media type: https://issues.apache.org/jira/browse/PARQUET-1889

@m-mohr
Copy link
Collaborator Author

m-mohr commented Feb 12, 2024

Great, so application/vnd.apache.parquet is the intention, so I'll update my implementations to support both application/vnd.apache.parquet and application/x-parquet for now. Do we still intend to add a geo profile or so for GeoParquet?

@kylebarron
Copy link
Collaborator

If @cholmes is right in his above comment, we'd have to register a geo profile along with the current parquet registration?

@m-mohr
Copy link
Collaborator Author

m-mohr commented Feb 13, 2024

Yes we would, if we go the official route. But we also never did with COG, and everyone just agreed on a de-facto standard of image/tiff; application=geotiff; profile=cloud-optimized

@cholmes
Copy link
Member

cholmes commented Feb 13, 2024

If @cholmes is right in his above comment, we'd have to register a geo profile along with the current parquet registration?

Yeah, if we want something listed in the official IANA then we'd need to do it. Like @m-mohr points out we can just add something on. With COG we wanted to get something registered but it basically wasn't an option.

So now is the time to try to advocate for it, if we want to. But I think we were leaning away from that, as mentioned in #115 (comment) I can try to bring it up at the next meeting, but if someone feels like we should push for a 'geo' profile then it'd be good to make the case here. I can't think of the use case where it's really essential, and it seems simpler for the file itself be the place to figure out if it's geo or not. And then not risk it being declared geo but not actually. And I don't see a case where it'd be good to have a non-geo parquet and a geoparquet version of the same file.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Feb 13, 2024

Same reason as for why COGs have a profile: A client can just detect easily what it is and whether it can render it without actually loading (parts of) the file. Think STAC Browser (and ol-stac, stac-layer, ...) for example... @cholmes

@tschaub
Copy link
Collaborator

tschaub commented Feb 18, 2024

If a Parquet file is served with a media type that indicates that it is GeoParquet, a client cannot blindly try to render it, for example (the same is true for COG, despite what others may believe). Before deciding what to do with the contents of a Parquet file, a client would need to read the footer - this is true for geo and non-geo Parquet. After reading the footer, you can see if it has the geo metadata.

@m-mohr - can you provide more specific examples of what a client like STAC Browser would do if it knew that a Parquet file was GeoParquet? If the answer is that it would display the geo-specific metadata, then this is going to require reading the footer of the file - which you can safely do for a non-geo Parquet file as well (and you might want to do anyway to show the user something about the data).

@m-mohr
Copy link
Collaborator Author

m-mohr commented Feb 22, 2024

It's all about giving users the nicest user experience without loading a whole lot of headers upfront. Thinking more about it, it might be more relevant for COGs than it is for GeoParquet files though.

Example:
Having a STAC file with 12 COGs, I'd like to display a default COG for visualization purposes. If I'd just have image/tiff or the geotiff equivalent as media type, I'd need to read all headers of 12 COG files to know which I might be able to render. Having a specific media type I know upfront whether I can or can not render the 12 COG files. I know there are edge cases, but by default STAC Browser indeed tries to blindly render it, which works in many cases. But it could also be just to differentiate the buttons: "Download" or "Show on map". It's a small UX improvement.

@cholmes
Copy link
Member

cholmes commented May 6, 2024

Looks like there's now a parquet media type, see #115

Search 'parquet' on https://www.iana.org/assignments/media-types/media-types.xhtml

application/vnd.apache.parquet

I think they could have gotten application/parquet pretty easily, but this does seem consistent with the other apache ones.

I'm going to go ahead and make a PR without a 'geo' parquet media type - we can revisit and add it later if there is a lot of value.

I do wonder if there's a 'hint' we could give in STAC, for 'show on map'. I also do think it's not crazy to try to blindly render, as most parquet in STAC will likely be geoparquet.

@m-mohr
Copy link
Collaborator Author

m-mohr commented May 6, 2024

I'd hope that in STAC people use the https://github.com/stac-extensions/table extension.
STAC Browser could also try to simply infer the "Show on map" from the table:primary_geometry.
So I'm not feeling as strongly anymore. So let's start with application/vnd.apache.parquet, indeed.

cholmes added a commit that referenced this issue May 6, 2024
As discussed in #115
@m-mohr m-mohr linked a pull request May 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants