Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-227] Implemented Python geometry serializer as a native extension #767

Merged
merged 2 commits into from
Feb 15, 2023

Conversation

Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Feb 14, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

This patch provides a Python geometry serializer implemented as a native extension. The original pure python serializer was still kept around as a fallback implementation when the native extension failed to load.

Please be acknowledged that the existence of native extensions will complicate the release process of apache-sedona python packages, since we have to build wheels for various CPython versions and platforms. There is a newly added Github Action for building wheels as a reference approach.

For platforms not covered by prebuilt wheels, users can easily install the package using the source distribution since the newly added extension does not require any third-party libraries to build.

How was this patch tested?

Multi-platform Compatibility

We've added a Github Action for testing the native extension on various platforms. We've also run tests on Apple M1, so it is also expected to work on various ARM64 platforms.

Performance

This is the result of running benchmarking code in #745 on an ECS instance with Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz. The python environment running the benchmark has shapely 2.0.0 installed.

short line serialize trial:
	Total Time (seconds):
		Shapely: 2.815020045
		Sedona: 0.051830825
		Factor: -0.9815877598839621

long line serialize trial:
	Total Time (seconds):
		Shapely: 6.53576705
		Sedona: 0.164516065
		Factor: -0.9748283462765094

point serialize trial:
	Total Time (seconds):
		Shapely: 7.001998222
		Sedona: 0.141237015
		Factor: -0.9798290415789825

small polygon serialize trial:
	Total Time (seconds):
		Shapely: 2.740874172
		Sedona: 0.065661008
		Factor: -0.9760437714832828

large polygon serialize trial:
	Total Time (seconds):
		Shapely: 3.32382634
		Sedona: 0.082012368
		Factor: -0.9753259166963578

small multipoint serialize trial:
	Total Time (seconds):
		Shapely: 0.139677264
		Sedona: 0.004877519
		Factor: -0.9650800791745177

large multipoint serialize trial:
	Total Time (seconds):
		Shapely: 0.283785556
		Sedona: 0.087477905
		Factor: -0.6917464502668346

small multilinestring serialize trial:
	Total Time (seconds):
		Shapely: 0.145490235
		Sedona: 0.003940004
		Factor: -0.9729191172177294

large multilinestring serialize trial:
	Total Time (seconds):
		Shapely: 0.177767081
		Sedona: 0.032028902
		Factor: -0.8198265853282476

small multipolygon serialize trial:
	Total Time (seconds):
		Shapely: 0.151684481
		Sedona: 0.006536263
		Factor: -0.9569088218062334

large multipolygon serialize trial:
	Total Time (seconds):
		Shapely: 0.291512183
		Sedona: 0.070903274
		Factor: -0.756774234029183

short line deserialize trial:
	Total Time (seconds):
		Shapely: 1.127085322
		Sedona: 0.09852999
		Factor: -0.9125798304025824

long line deserialize trial:
	Total Time (seconds):
		Shapely: 3.146504444
		Sedona: 0.28110212
		Factor: -0.9106620934427639

point deserialize trial:
	Total Time (seconds):
		Shapely: 2.677137326
		Sedona: 0.206143454
		Factor: -0.9229985507288093

small polygon deserialize trial:
	Total Time (seconds):
		Shapely: 1.182006391
		Sedona: 0.138792694
		Factor: -0.8825787279520725

large polygon deserialize trial:
	Total Time (seconds):
		Shapely: 1.666326684
		Sedona: 0.157133878
		Factor: -0.9057004370698777

small multipoint deserialize trial:
	Total Time (seconds):
		Shapely: 0.058293033
		Sedona: 0.006954283
		Factor: -0.8807013009599277

large multipoint deserialize trial:
	Total Time (seconds):
		Shapely: 0.178866976
		Sedona: 0.067565497
		Factor: -0.6222584039213589

small multilinestring deserialize trial:
	Total Time (seconds):
		Shapely: 0.062594765
		Sedona: 0.00923537
		Factor: -0.852457789401398

large multilinestring deserialize trial:
	Total Time (seconds):
		Shapely: 0.144816222
		Sedona: 0.089089979
		Factor: -0.38480663443906166

small multipolygon deserialize trial:
	Total Time (seconds):
		Shapely: 0.070458039
		Sedona: 0.01415622
		Factor: -0.7990829690846207

large multipolygon deserialize trial:
	Total Time (seconds):
		Shapely: 0.242991092
		Sedona: 0.182560864
		Factor: -0.24869318254679065

Here is the benchmark result in a more comprehensive format (obtained by running rich-bench):

                                              Benchmarks, repeat=5, number=5
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃                           Benchmark ┃ Min     ┃ Max     ┃ Mean    ┃ Min (+)         ┃ Max (+)         ┃ Mean (+)        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│              serialize - short line │ 1.451   │ 1.483   │ 1.465   │ 0.029 (49.2x)   │ 0.036 (41.7x)   │ 0.032 (46.5x)   │
│               serialize - long line │ 3.159   │ 3.646   │ 3.329   │ 0.085 (37.3x)   │ 0.097 (37.4x)   │ 0.091 (36.7x)   │
│                   serialize - point │ 3.480   │ 3.529   │ 3.513   │ 0.080 (43.3x)   │ 0.082 (43.2x)   │ 0.081 (43.3x)   │
│           serialize - small polygon │ 1.425   │ 1.529   │ 1.487   │ 0.039 (36.3x)   │ 0.040 (38.6x)   │ 0.039 (37.7x)   │
│           serialize - large polygon │ 1.701   │ 1.757   │ 1.737   │ 0.047 (36.1x)   │ 0.049 (35.6x)   │ 0.048 (36.3x)   │
│        serialize - small multipoint │ 0.713   │ 0.763   │ 0.741   │ 0.027 (26.4x)   │ 0.028 (27.4x)   │ 0.027 (27.0x)   │
│        serialize - large multipoint │ 0.148   │ 0.148   │ 0.148   │ 0.044 (3.3x)    │ 0.045 (3.3x)    │ 0.044 (3.3x)    │
│   serialize - small multilinestring │ 0.742   │ 0.786   │ 0.761   │ 0.023 (32.7x)   │ 0.024 (32.8x)   │ 0.023 (33.0x)   │
│   serialize - large multilinestring │ 0.182   │ 0.189   │ 0.184   │ 0.031 (5.8x)    │ 0.031 (6.0x)    │ 0.031 (5.9x)    │
│      serialize - small multipolygon │ 0.778   │ 0.791   │ 0.786   │ 0.036 (21.9x)   │ 0.037 (21.4x)   │ 0.036 (21.9x)   │
│      serialize - large multipolygon │ 0.294   │ 0.311   │ 0.303   │ 0.074 (4.0x)    │ 0.074 (4.2x)    │ 0.074 (4.1x)    │
│            deserialize - short line │ 0.588   │ 0.611   │ 0.595   │ 0.058 (10.2x)   │ 0.066 (9.3x)    │ 0.061 (9.8x)    │
│             deserialize - long line │ 1.604   │ 1.640   │ 1.625   │ 0.141 (11.4x)   │ 0.145 (11.3x)   │ 0.143 (11.4x)   │
│                 deserialize - point │ 1.375   │ 1.397   │ 1.387   │ 0.123 (11.2x)   │ 0.177 (7.9x)    │ 0.135 (10.3x)   │
│         deserialize - small polygon │ 0.598   │ 0.606   │ 0.600   │ 0.074 (8.1x)    │ 0.075 (8.0x)    │ 0.074 (8.1x)    │
│         deserialize - large polygon │ 0.839   │ 0.854   │ 0.843   │ 0.080 (10.4x)   │ 0.082 (10.4x)   │ 0.081 (10.4x)   │
│      deserialize - small multipoint │ 0.298   │ 0.302   │ 0.299   │ 0.036 (8.2x)    │ 0.037 (8.2x)    │ 0.036 (8.2x)    │
│      deserialize - large multipoint │ 0.094   │ 0.095   │ 0.095   │ 0.036 (2.6x)    │ 0.036 (2.6x)    │ 0.036 (2.6x)    │
│ deserialize - small multilinestring │ 0.313   │ 0.330   │ 0.317   │ 0.047 (6.6x)    │ 0.049 (6.7x)    │ 0.048 (6.6x)    │
│ deserialize - large multilinestring │ 0.148   │ 0.149   │ 0.149   │ 0.093 (1.6x)    │ 0.093 (1.6x)    │ 0.093 (1.6x)    │
│    deserialize - small multipolygon │ 0.348   │ 0.350   │ 0.348   │ 0.070 (5.0x)    │ 0.072 (4.8x)    │ 0.070 (4.9x)    │
│    deserialize - large multipolygon │ 0.249   │ 0.250   │ 0.250   │ 0.173 (1.4x)    │ 0.189 (1.3x)    │ 0.176 (1.4x)    │
└─────────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

We've also run the benchmark with shapely 1.8.5. The performance improvement of deserialization was not that significant:

short line serialize trial:
	Total Time (seconds):
		Shapely: 2.213963099
		Sedona: 0.090474673
		Factor: -0.9591345162704539

long line serialize trial:
	Total Time (seconds):
		Shapely: 5.501387987
		Sedona: 0.577870113
		Factor: -0.8949592149534753

point serialize trial:
	Total Time (seconds):
		Shapely: 5.418881421
		Sedona: 0.260684318
		Factor: -0.9518933341132434

small polygon serialize trial:
	Total Time (seconds):
		Shapely: 2.252228822
		Sedona: 0.139868206
		Factor: -0.9378978704855594

large polygon serialize trial:
	Total Time (seconds):
		Shapely: 3.026419488
		Sedona: 0.274837045
		Factor: -0.9091873925310912

small multipoint serialize trial:
	Total Time (seconds):
		Shapely: 0.115843174
		Sedona: 0.007144615
		Factor: -0.9383251101182708

large multipoint serialize trial:
	Total Time (seconds):
		Shapely: 0.259107293
		Sedona: 0.0716183
		Factor: -0.7235959699521078

small multilinestring serialize trial:
	Total Time (seconds):
		Shapely: 0.119373969
		Sedona: 0.006568555
		Factor: -0.9449749802655887

large multilinestring serialize trial:
	Total Time (seconds):
		Shapely: 0.176183789
		Sedona: 0.030000746
		Factor: -0.8297190327766194

small multipolygon serialize trial:
	Total Time (seconds):
		Shapely: 0.126003143
		Sedona: 0.009911846
		Factor: -0.9213365177724178

large multipolygon serialize trial:
	Total Time (seconds):
		Shapely: 0.268871198
		Sedona: 0.079426367
		Factor: -0.7045932491437777

short line deserialize trial:
	Total Time (seconds):
		Shapely: 3.075912209
		Sedona: 0.909106639
		Factor: -0.7044432424501619

long line deserialize trial:
	Total Time (seconds):
		Shapely: 4.058073919
		Sedona: 0.691233541
		Factor: -0.8296646254363116

point deserialize trial:
	Total Time (seconds):
		Shapely: 7.242984392
		Sedona: 2.262892611
		Factor: -0.6875745564909123

small polygon deserialize trial:
	Total Time (seconds):
		Shapely: 3.110031226
		Sedona: 1.00743986
		Factor: -0.6760676061456368

large polygon deserialize trial:
	Total Time (seconds):
		Shapely: 2.658971574
		Sedona: 0.595152667
		Factor: -0.7761718580147559

small multipoint deserialize trial:
	Total Time (seconds):
		Shapely: 0.176477316
		Sedona: 0.04896167
		Factor: -0.7225611137467662

large multipoint deserialize trial:
	Total Time (seconds):
		Shapely: 0.265787445
		Sedona: 0.128952135
		Factor: -0.5148298483399019

small multilinestring deserialize trial:
	Total Time (seconds):
		Shapely: 0.24347195
		Sedona: 0.076274123
		Factor: -0.6867231605119194

large multilinestring deserialize trial:
	Total Time (seconds):
		Shapely: 0.276038444
		Sedona: 0.174092909
		Factor: -0.3693164384016018

small multipolygon deserialize trial:
	Total Time (seconds):
		Shapely: 0.255794987
		Sedona: 0.065465672
		Factor: -0.7440697616173377

large multipolygon deserialize trial:
	Total Time (seconds):
		Shapely: 0.277289757
		Sedona: 0.213857105
		Factor: -0.2287594489110537
                                              Benchmarks, repeat=5, number=5                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃                           Benchmark ┃ Min     ┃ Max     ┃ Mean    ┃ Min (+)         ┃ Max (+)         ┃ Mean (+)        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│              serialize - short line │ 1.162   │ 1.185   │ 1.169   │ 0.049 (23.7x)   │ 0.049 (24.0x)   │ 0.049 (23.8x)   │
│               serialize - long line │ 2.875   │ 2.880   │ 2.876   │ 0.291 (9.9x)    │ 0.293 (9.8x)    │ 0.292 (9.9x)    │
│                   serialize - point │ 2.883   │ 2.928   │ 2.902   │ 0.128 (22.5x)   │ 0.129 (22.7x)   │ 0.129 (22.6x)   │
│           serialize - small polygon │ 1.192   │ 1.212   │ 1.197   │ 0.062 (19.2x)   │ 0.063 (19.4x)   │ 0.062 (19.2x)   │
│           serialize - large polygon │ 1.536   │ 1.557   │ 1.541   │ 0.141 (10.9x)   │ 0.143 (10.9x)   │ 0.142 (10.8x)   │
│        serialize - small multipoint │ 0.612   │ 0.632   │ 0.621   │ 0.035 (17.4x)   │ 0.036 (17.6x)   │ 0.035 (17.5x)   │
│        serialize - large multipoint │ 0.134   │ 0.134   │ 0.134   │ 0.036 (3.7x)    │ 0.036 (3.7x)    │ 0.036 (3.7x)    │
│   serialize - small multilinestring │ 0.625   │ 0.640   │ 0.630   │ 0.032 (19.2x)   │ 0.033 (19.2x)   │ 0.033 (19.2x)   │
│   serialize - large multilinestring │ 0.174   │ 0.174   │ 0.174   │ 0.031 (5.6x)    │ 0.031 (5.6x)    │ 0.031 (5.6x)    │
│      serialize - small multipolygon │ 0.661   │ 0.665   │ 0.663   │ 0.048 (13.6x)   │ 0.066 (10.1x)   │ 0.052 (12.7x)   │
│      serialize - large multipolygon │ 0.275   │ 0.275   │ 0.275   │ 0.081 (3.4x)    │ 0.082 (3.4x)    │ 0.081 (3.4x)    │
│            deserialize - short line │ 1.508   │ 1.529   │ 1.516   │ 0.462 (3.3x)    │ 0.484 (3.2x)    │ 0.471 (3.2x)    │
│             deserialize - long line │ 2.025   │ 2.046   │ 2.033   │ 0.356 (5.7x)    │ 0.359 (5.7x)    │ 0.357 (5.7x)    │
│                 deserialize - point │ 3.714   │ 3.770   │ 3.735   │ 1.144 (3.2x)    │ 1.161 (3.2x)    │ 1.153 (3.2x)    │
│         deserialize - small polygon │ 1.545   │ 1.568   │ 1.555   │ 0.498 (3.1x)    │ 0.508 (3.1x)    │ 0.501 (3.1x)    │
│         deserialize - large polygon │ 1.410   │ 1.539   │ 1.470   │ 0.314 (4.5x)    │ 0.351 (4.4x)    │ 0.329 (4.5x)    │
│      deserialize - small multipoint │ 0.829   │ 0.872   │ 0.842   │ 0.256 (3.2x)    │ 0.267 (3.3x)    │ 0.261 (3.2x)    │
│      deserialize - large multipoint │ 0.139   │ 0.148   │ 0.142   │ 0.060 (2.3x)    │ 0.061 (2.4x)    │ 0.060 (2.4x)    │
│ deserialize - small multilinestring │ 0.849   │ 0.921   │ 0.865   │ 0.278 (3.1x)    │ 0.282 (3.3x)    │ 0.281 (3.1x)    │
│ deserialize - large multilinestring │ 0.194   │ 0.236   │ 0.205   │ 0.119 (1.6x)    │ 0.127 (1.9x)    │ 0.121 (1.7x)    │
│    deserialize - small multipolygon │ 0.872   │ 0.954   │ 0.890   │ 0.311 (2.8x)    │ 0.327 (2.9x)    │ 0.320 (2.8x)    │
│    deserialize - large multipolygon │ 0.289   │ 0.308   │ 0.296   │ 0.212 (1.4x)    │ 0.243 (1.3x)    │ 0.225 (1.3x)    │
└─────────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the docs.

@Kontinuation Kontinuation marked this pull request as ready for review February 14, 2023 03:18
@Kontinuation
Copy link
Member Author

Kontinuation commented Feb 14, 2023

The Github Action for building wheels is unable to run. Seems that Github does not allow running cibuildwheel on ASF projects. A successful run of this action will produce a downloadable package containing python wheels, please refer to runs/4169899230 for the result of a successful run.

If using cibuildwheel is definitely not an option, I'll investigate other approaches.

@jiayuasu
Copy link
Member

Just create https://issues.apache.org/jira/browse/INFRA-24203 Let's see if the ASF infra team will allow this GitHub action.

@jiayuasu
Copy link
Member

@douglasdennis @umartin Doug and Martin, since you were following the new serializer improvement, any comment on this PR?

@umartin
Copy link
Contributor

umartin commented Feb 14, 2023

@douglasdennis @umartin Doug and Martin, since you were following the new serializer improvement, any comment on this PR?

I haven't worked with C in 20 years, but from my understanding, everything looks good. I don't have any objections to merging.


/* The following are function pointers to GEOS C APIs provided by
* libgeos_c. These functions must be called after a successful invocation of
* `load_geos_c_functions` */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find this function. I probably missed it though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function to initialize geos_c_dyn should be load_geos_c_library or load_geos_c_from_handle. I've fixed the comment.

* libgeos_c. These functions must be called after a successful invocation of
* `load_geos_c_functions` */

#include "geos_c_dyn.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These headers include each other. Is that due to some nuance of C?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the recursive inclusion since geos_c_dyn_funcs.h is not meant to be a self-contained header file.

@douglasdennis
Copy link
Contributor

@Kontinuation Will this work out of the box with a pip install? Like, once this gets pushed to pypi, will I be able to just do pip install apache-sedona and I'll get the speeded up version?

@jiayuasu jiayuasu added this to the sedona-1.4.0 milestone Feb 15, 2023
@Kontinuation
Copy link
Member Author

@Kontinuation Will this work out of the box with a pip install? Like, once this gets pushed to pypi, will I be able to just do pip install apache-sedona and I'll get the speeded up version?

In most cases, yes. We will release precompiled wheels for commonly used platforms.

You'll still encounter problems on platforms not having matching wheel releases, such as Windows running on ARM64. If you do not have toolchains installed on your system, pip install won't be able to build the extension from the source release.

@douglasdennis
Copy link
Contributor

@Kontinuation Will this work out of the box with a pip install? Like, once this gets pushed to pypi, will I be able to just do pip install apache-sedona and I'll get the speeded up version?

In most cases, yes. We will release precompiled wheels for commonly used platforms.

You'll still encounter problems on platforms not having matching wheel releases, such as Windows running on ARM64. If you do not have toolchains installed on your system, pip install won't be able to build the extension from the source release.

That's awesome! I'm excited to be able to use it. Thanks for putting so much work into this, this will all help me out quite a bit.

@jiayuasu jiayuasu merged commit abe41fa into apache:master Feb 15, 2023
@Kontinuation Kontinuation deleted the native-python-geom-serde branch August 23, 2023 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants