Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast_float header library for fast string->double, and optimize Parquet WKT->WKB conversion #7690

Merged
merged 9 commits into from
May 23, 2023

Conversation

rouault
Copy link
Member

@rouault rouault commented May 4, 2023

Background for this work: opengeospatial/geoparquet#170

Several related items:

  • add a vendored copy of https://github.com/fastfloat/fast_float in third_party/ subdirectory.
  • add a port/include_fast_float.h that uses the system fast_float/fast_float.h if found (determined by __has_include("fast_float/fast_float.h") or by building explicitly with -DUSE_SYSTEM_FAST_FLOAT=1), or fallbacks to the vendored one
  • update CPLStrtod() family of function to use fast_float
  • Parquet: add a COORDINATE_PRECISION layer creation option to set the number of decimal figures to WKT geometries (when using -lco GEOMETRY_ENCODING=WKT)
  • Parquet: when using the Arrow Array reader with the GEOMETRY_ENCODING=WKB option that forces geometries to be encoded in WKB, add an optimized WKT->WKB translator that avoids going through intermediate OGRFeature, and add a very specifc optimization for translation of single-part single-ring multipolygons, as used by the "reference" https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet dataset which contains 3.2 million features with 13 attributes each and polygons that have generally a small number of vertices each (< 10).

Benchmarks:

  • Creation of GeoParquet files (using Snappy compression):
Variant of geoparquet Timing File size
WKB 12.2 s 436 MB
WKT, 15 significant digits 1m 36s 666 MB
WKT, 4 decimals 1m 27s 474 MB

This shows that generation of WKT is very slow compared to WKB (I presume this could be optimized to be at least twice faster)

  • Running bench_ogr_c_api, that is iterating over OGRFeature returned by OGRLayer::GetNextFeature()
Variant of geoparquet Timing
WKB 5.9 s
WKT, 15 significant digits 14.1 s
WKT, 4 decimals 12.3 s

The timings of the WKT parsing have been improved through the use of the fast_float library (e.g. 16.8 s before for "WKT, 4 decimals" dataset when using strtod())

  • Running bench_ogr_batch --stream-opt GEOMETRY_ENCODING=WKB, that is using the Arrow Array streaming mode, but forcing geometries to be WKB even if it is not their native encoding
Variant of geoparquet Timing
WKB 0.96 s
WKT, 15 significant digits 5.3 s
WKT, 4 decimals 4.6 s

The timings with the WKT files have been considerably improved by avoiding the round-trip to OGRFeature and using fast_float as well (e.g 25 s before for "WKT, 4 decimals" dataset when going through OGRFeature and using strtod())

For those read benchmarks, we're probably clause to the optimal performance

@rouault rouault added this to the 3.8.0 milestone May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant