Recommendation on the Arrow specific type for the WKB geometry column ? #187

rouault · 2023-10-27T17:48:20Z

The GeoParquet spec rightly specfies the type of geometry columns in terms of Parquet type: "Geometry columns MUST be stored using the BYTE_ARRAY parquet type"
For implementations using the Arrow library (typically from C++, but possibly from Python or other languages), they might use the Arrow API for Parquet reading & writing and thus not be directly exposed to Parquet types.
I came to realize recently that the GDAL implementation would only work for the Arrow::Type::BINARY, but not for Arrow::Type::LARGE_BINARY. This has been addressed, on the reading side of the driver, for GDAL 3.8.0 per OSGeo/gdal#8618 .
I'm not entirely clear how the Arrow library maps a large Parquet file with row group with a WKB column with more than 2 GB of content. I assume that would be Arrow::Type::LARGE_BINARY ?
So the question is if there should be some hints in the spec for implementations using the Arrow library on:

which Arrow types they should expect on the reading side
which Arrow type(s) they should use on the writing side

paleolimbot · 2023-10-27T18:29:21Z

which Arrow type(s) they should use on the writing side

I would say that it is perhaps good practice to write row groups such that the WKB column has chunks that fit into a (non-large) binary array, although it's difficult to guarantee that I think (@jorisvandenbossche would know better than I would). pyarrow prefers to chunk arrays that would contain more than 2GB of content rather than return large binary arrays, but the R bindings don't.

which Arrow types they should expect on the reading side

I think it's possible to end up with binary, large binary, or fixed-size binary that all use a byte array in Parquet land. (I don't know if it is desirable to allow all three of those but they are all ways to represent binary). GeoArrow doesn't mention the fixed-size option in its spec and I think we're planning to keep it that way (or perhaps I'm forgetting a previous thread).

hobu · 2023-10-30T13:05:26Z

As another data point, I also only implemented arrow::Type::BINARY for PDAL, but I was confused if I should support all three possibilities.

paleolimbot · 2023-10-30T13:20:29Z

In the absence of more informed guidance on how binary-like things get encoded in Parquet files (I'm somewhat new to this), I would gander that it is probably a good idea to support all three. It's a bit of a learning curve, but in theory arrow::type_traits ( https://github.com/apache/arrow/blob/main/cpp/src/arrow/type_traits.h#L334-L372 ) is there to help support all of them with mostly the same code.

jorisvandenbossche · 2023-11-14T14:24:08Z

On the reading side (and when reading using the Arrow C++ library or its bindings), it depends on whether you have a file that was written by Arrow or not (i.e. whether it includes a serialized Arrow schema in the Parquet FileMetadata, which you can also turn off with store_schema=False when writing):

If you have a plain Parquet file without Arrow schema information: a BYTE_ARRAY parquet column is always mapped to Arrow's binary type
If you have a Parquet file with Arrow schema information, then Arrow will cast the binary data to that type while reading, and so in theory you can then get all variable-size binary types (binary, large_binary, binary_view), depending on how the file was written.

The fixed_size_binary Arrow type gets written as FIXED_LEN_BYTE_ARRAY on the Parquet side, so I think that should never come back when reading BYTE_ARRAY. Thus, I think we can clearly exclude the fixed size binary type for handling in applications like GDAL (given we clearly say that GeoParquet should use BYTE_ARRAY).

I'm not entirely clear how the Arrow library maps a large Parquet file with row group with a WKB column with more than 2 GB of content. I assume that would be Arrow::Type::LARGE_BINARY ?

In the first bullet point above I said that this would always map to binary, not large_binary. That's based on looking at the code (and some code comment says this, but who knows if that might be outdated). I also tried to reproduce those cases, but I didn't get it to return large binary.
When reading a Parquet file with a big row group with many large string values that would require large_binary, Arrow C++ seems to always read it in some chunks, even when I set the batch size to read to a value > number of rows (at that point it seems it has some internal max batch size or max data size to process at once). So in practice, yes it seems the Arrow C++ Parquet reader wil always gives you chunked data.
When trying to create a Parquet file with one huge string that would require large_binary, I get the error "Parquet cannot store strings with size 2GB or more".

Now, if the Parquet file was written with the Arrow schema stored (eg using pyarrow, or using GDAL as I see it has that enabled), you will get back however Arrow schema of the written data, and so in that case you can get other types than just binary. I think it is good for GDAL to handle large_binary (as you already do), and we can probably also recommend that in general for reading.

For writing, I agree with what Dewey wrote above: generally you should use a row group size such that binary will suffice in most cases (you would already need huge WKB values to reach the 2GB limit for a few thousands of rows).

Fixes opengeospatial#187

rouault · 2023-11-17T14:04:46Z

I've tried to sum up the outcome of that discussion in #190

…pes (#190) Fixes #187

jorisvandenbossche changed the title ~~Recommandation on the Arrow specific type for the WKB geometry column ?~~ Recommendation on the Arrow specific type for the WKB geometry column ? Nov 14, 2023

rouault added a commit to rouault/geoparquet that referenced this issue Nov 17, 2023

Spec: add a note about mapping of Parquet BYTE_ARRAY type to Arrow types

cc58755

Fixes opengeospatial#187

rouault mentioned this issue Nov 17, 2023

Spec: add a note about mapping of Parquet BYTE_ARRAY type to Arrow types #190

Merged

jorisvandenbossche closed this as completed in #190 Nov 17, 2023

jorisvandenbossche pushed a commit that referenced this issue Nov 17, 2023

Spec: add a note about mapping of Parquet BYTE_ARRAY type to Arrow ty…

9f96beb

…pes (#190) Fixes #187

cholmes added this to the 1.1 milestone May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendation on the Arrow specific type for the WKB geometry column ? #187

Recommendation on the Arrow specific type for the WKB geometry column ? #187

rouault commented Oct 27, 2023 •

edited

Loading

paleolimbot commented Oct 27, 2023

hobu commented Oct 30, 2023

paleolimbot commented Oct 30, 2023

jorisvandenbossche commented Nov 14, 2023

rouault commented Nov 17, 2023

Recommendation on the Arrow specific type for the WKB geometry column ? #187

Recommendation on the Arrow specific type for the WKB geometry column ? #187

Comments

rouault commented Oct 27, 2023 • edited Loading

paleolimbot commented Oct 27, 2023

hobu commented Oct 30, 2023

paleolimbot commented Oct 30, 2023

jorisvandenbossche commented Nov 14, 2023

rouault commented Nov 17, 2023

rouault commented Oct 27, 2023 •

edited

Loading