Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-227] Python Serde Refactor #745

Merged
merged 16 commits into from
Jan 13, 2023

Conversation

douglasdennis
Copy link
Contributor

@douglasdennis douglasdennis commented Jan 5, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

A refactor of some of the python serialization/deserialization functions is proposed. This is due to the performance regression experienced from the change in serialization formats. Most of the functions were refactored to increase performance. Attempts were made to maintain readability of the code at the same time.

A significant pain point involves interacting with shapely geometry. Most of the shapely methods are too slow and involve repeated calls to native code. Once shapely 2.0 is adopted by Sedona and its users, it may be a good idea to revisit this code and attempt to utilize some of the libgeos connections that shapely exposes.

There are other issues in the current implementation of serde:

  1. Shapely does not respect an M coordinate yet, and so that is not supported in python at the moment.
  2. SRID is not supported in python for a similar reason. Part of Sedona's python library does add a "userData" field to shapely geometry classes at runtime. This could be utilized to store SRIDs and possibly M coordinates.
  3. This PR mostly avoids using the GeometryBuffer class. I don't believe it will be the fastest solution in the future. However, I think it's a nice class for specifically handling the GeometryCollection type.
  4. Should the serialization format allow for empty geometry within collections? They have no WKT representation, so I'm not sure that should be allowed.

Finally, here are performance results between current master and this branch:

############################################################################################
#     MASTER                                #              REFACTOR                        #
############################################################################################
short line serialize trial:                 #        short line serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 3.1197412          #                Shapely: 1.6295438
                Sedona: 7.6139485           #                Sedona: 1.624361
                Factor: 1.440570551172642   #                Factor: -0.00318052205776856
                                            #
long line serialize trial:                  #        long line serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 4.3532153          #                Shapely: 4.106511
                Sedona: 53.1161724          #                Sedona: 20.2585903
                Factor: 11.201595542494763  #                Factor: 3.9332852876809534
                                            #
point serialize trial:                      #        point serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 4.5569092          #                Shapely: 4.1321635
                Sedona: 12.9271814          #                Sedona: 3.6287594
                Factor: 1.8368310257312128  #                Factor: -0.12182579416327549
                                            #
small polygon serialize trial:              #        small polygon serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 1.9922023          #                Shapely: 1.6930909
                Sedona: 15.1673936          #                Sedona: 2.5668217
                Factor: 6.613380227499988   #                Factor: 0.5160566393688608
                                            #
large polygon serialize trial:              #        large polygon serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 2.5921638          #                Shapely: 2.3090157
                Sedona: 27.4987078          #                Sedona: 2.8944101
                Factor: 9.608398975404254   #                Factor: 0.25352551738821005
                                            #
small multipoint serialize trial:           #        small multipoint serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.0964883          #                Shapely: 0.0957348
                Sedona: 0.8657994           #                Sedona: 0.5249629
                Factor: 7.97310243832672    #                Factor: 4.483511742856307
                                            #
large multipoint serialize trial:           #        large multipoint serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.2059924          #                Shapely: 0.2002637
                Sedona: 18.6648998          #                Sedona: 13.9681053
                Factor: 89.60965258912465   #                Factor: 68.74856301965858
                                            #
small multilinestring serialize trial:      #        small multilinestring serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.1003702          #                Shapely: 0.0984753
                Sedona: 1.2611056           #                Sedona: 0.5051064
                Factor: 11.564542065274354  #                Factor: 4.129269979375539
                                            #
large multilinestring serialize trial:      #        large multilinestring serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.1382676          #                Shapely: 0.1290179
                Sedona: 14.4627404          #                Sedona: 6.4109233
                Factor: 103.59963433226584  #                Factor: 48.69018485031922
                                            #
small multipolygon serialize trial:         #        small multipolygon serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.1074735          #                Shapely: 0.107037
                Sedona: 2.5492284           #                Sedona: 1.1849261
                Factor: 22.71959971527865   #                Factor: 10.070247671365976
                                            #
large multipolygon serialize trial:         #        large multipolygon serialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.218638           #                Shapely: 0.2061548
                Sedona: 33.7390888          #                Sedona: 16.8197829
                Factor: 153.31484371426743  #                Factor: 80.58812164451179
                                            #
short line deserialize trial:               #        short line deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 2.6179912          #                Shapely: 2.3070263
                Sedona: 1.8462964           #                Sedona: 1.8466384
                Factor: -0.2947660022692208 #                Factor: -0.19955901673075854
                                            #
long line deserialize trial:                #        long line deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 3.1905663          #                Shapely: 3.0322167
                Sedona: 46.6516656          #                Sedona: 3.0514442
                Factor: 13.62175087851959   #                Factor: 0.006341070544199562
                                            #
point deserialize trial:                    #        point deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 5.8758644          #                Shapely: 5.7208507
                Sedona: 5.2898097           #                Sedona: 5.2560053
                Factor: -0.0997393166527124 #                Factor: -0.08125459383164815
                                            #
small polygon deserialize trial:            #        small polygon deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 2.4844221          #                Shapely: 2.3755022
                Sedona: 4.120936            #                Sedona: 3.0382661
                Factor: 0.6587100879516408  #                Factor: 0.27899948903436084
                                            #
large polygon deserialize trial:            #        large polygon deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 2.0467411          #                Shapely: 1.9710574
                Sedona: 20.173081           #                Sedona: 3.0417363
                Factor: 8.85619578362891    #                Factor: 0.5432002639801358
                                            #
small multipoint deserialize trial:         #        small multipoint deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.1267147          #                Shapely: 0.1221836
                Sedona: 0.4915774           #                Sedona: 0.4675109
                Factor: 2.8794031000349603  #                Factor: 2.8262982920784787
                                            #
large multipoint deserialize trial:         #        large multipoint deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.2152692          #                Shapely: 0.2037456
                Sedona: 10.9262919          #                Sedona: 11.1691675
                Factor: 49.75641057801116   #                Factor: 53.81918382531942
                                            #
small multilinestring deserialize trial:    #        small multilinestring deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.1260666          #                Shapely: 0.1310238
                Sedona: 0.4858702           #                Sedona: 0.5045361
                Factor: 2.8540755441964802  #                Factor: 2.8507210140447765
                                            #
large multilinestring deserialize trial:    #        large multilinestring deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.1432103          #                Shapely: 0.1417217
                Sedona: 5.63192             #                Sedona: 5.8515723
                Factor: 38.3262216474653    #                Factor: 40.28917660457079
                                            #
small multipolygon deserialize trial:       #        small multipolygon deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.1392047          #                Shapely: 0.12827
                Sedona: 1.3398053           #                Sedona: 1.1914238
                Factor: 8.624713102359332   #                Factor: 8.288405706712403
                                            #
large multipolygon deserialize trial:       #        large multipolygon deserialize trial:
        Total Time (seconds):               #            Total Time (seconds):
                Shapely: 0.2063552          #                Shapely: 0.2101143
                Sedona: 19.0151752          #                Sedona: 15.8966164
                Factor: 91.14778789194554   #                Factor: 74.65699431214344

Here is the python script I used to generate these results:

from sedona.utils.geometry_serde import serialize, deserialize
from shapely.geometry import LineString, Point, Polygon, MultiPoint, MultiLineString, MultiPolygon
from shapely.wkb import dumps, loads

import time

def run_serialize_trial(geom, number_iterations, name):
    print(f"{name} serialize trial:")

    start_time = time.perf_counter_ns()
    for _ in range(number_iterations):
        dumps(geom)
    shapely_time = time.perf_counter_ns() - start_time

    start_time = time.perf_counter_ns()
    for _ in range(number_iterations):
        serialize(geom)
    sedona_time = time.perf_counter_ns() - start_time

    print(f"\tTotal Time (seconds):")
    print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")

def run_deserialize_trial(geom, number_iterations, name):
    print(f"{name} deserialize trial:")

    shapely_serialized_geom = dumps(geom)
    sedona_serialized_geom = serialize(geom)

    start_time = time.perf_counter_ns()
    for _ in range(number_iterations):
        loads(shapely_serialized_geom)
    shapely_time = time.perf_counter_ns() - start_time

    start_time = time.perf_counter_ns()
    for _ in range(number_iterations):
        deserialize(sedona_serialized_geom)
    sedona_time = time.perf_counter_ns() - start_time

    print(f"\tTotal Time (seconds):")
    print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")

short_line_iterations = 200_000
short_line = LineString([(10.0, 10.0), (20.0, 20.0)])

long_line_iterations = 100_000
long_line = LineString([(float(n), float(n)) for n in range(1000)])

point_iterations = 500_000
point = Point(12.3, 45.6)

small_polygon_iterations = 200_000
small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])

large_polygon_iterations = 100_000
large_polygon = Polygon(
    [(0.0, float(n * 10)) for n in range(100)]
    + [(float(n * 10), 990.0) for n in range(100)]
    + [(990.0, float(n * 10)) for n in reversed(range(100))]
    + [(float(n * 10), 0.0) for n in reversed(range(100))]
)

small_multipoint_iterations = 10_000
small_multipoint = MultiPoint([(n, n) for n in range(3)])

large_multipoint_iterations = 10_000
large_multipoint = MultiPoint([(n, n) for n in range(100)])

small_multilinestring_iterations = 10_000
small_multilinestring = MultiLineString([[(10.0, 10.0), (20.0, 20.0)] for _ in range(3)])

large_multilinestring_iterations = 5_000
large_multilinestring = MultiLineString([[(10.0, 10.0), (20.0, 20.0)] for _ in range(100)])

small_multipolygon_iterations = 10_000
small_multipolygon = MultiPolygon([small_polygon for _ in range(3)])

large_multipolygon_iterations = 5_000
large_multipolygon = MultiPolygon([small_polygon for _ in range(100)])

run_serialize_trial(short_line, short_line_iterations, "short line")
run_serialize_trial(long_line, long_line_iterations, "long line")
run_serialize_trial(point, point_iterations, "point")
run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon")
run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon")
run_serialize_trial(small_multipoint, small_multipoint_iterations, "small multipoint")
run_serialize_trial(large_multipoint, large_multipoint_iterations, "large multipoint")
run_serialize_trial(small_multilinestring, small_multilinestring_iterations, "small multilinestring")
run_serialize_trial(large_multilinestring, large_multilinestring_iterations, "large multilinestring")
run_serialize_trial(small_multipolygon, small_multipolygon_iterations, "small multipolygon")
run_serialize_trial(large_multipolygon, large_multipolygon_iterations, "large multipolygon")

run_deserialize_trial(short_line, short_line_iterations, "short line")
run_deserialize_trial(long_line, long_line_iterations, "long line")
run_deserialize_trial(point, point_iterations, "point")
run_deserialize_trial(small_polygon, small_polygon_iterations, "small polygon")
run_deserialize_trial(large_polygon, large_polygon_iterations, "large polygon")
run_deserialize_trial(small_multipoint, small_multipoint_iterations, "small multipoint")
run_deserialize_trial(large_multipoint, large_multipoint_iterations, "large multipoint")
run_deserialize_trial(small_multilinestring, small_multilinestring_iterations, "small multilinestring")
run_deserialize_trial(large_multilinestring, large_multilinestring_iterations, "large multilinestring")
run_deserialize_trial(small_multipolygon, small_multipolygon_iterations, "small multipolygon")
run_deserialize_trial(large_multipolygon, large_multipolygon_iterations, "large multipolygon")

How was this patch tested?

A parameterized set of unit tests were added that test the serialization/deserialization round trip of a geometry object between python and spark.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the docs.

@jiayuasu
Copy link
Member

jiayuasu commented Jan 5, 2023

@Kontinuation @Imbruced Please feel free to chime in with comments :-)

@jiayuasu jiayuasu added this to the sedona-1.4.0 milestone Jan 5, 2023
@Kontinuation
Copy link
Member

LGTM. The refactored implementation looks much more pythonic while having better performance. It is also a great move to remove the expensive has_z calls, as shapely currently does not support M dimension.

@douglasdennis
Copy link
Contributor Author

@Kontinuation What's your take on dropping support for empty geometry inside of multi-geometries in the new serialization format? They don't have representation in any other form so I'm not sure that supporting them provides any value.

@Kontinuation
Copy link
Member

Kontinuation commented Jan 6, 2023

@Kontinuation What's your take on dropping support for empty geometry inside of multi-geometries in the new serialization format? They don't have representation in any other form so I'm not sure that supporting them provides any value.

I think it is OK to ignore empty geometries inside geometry collections since it has no impact on the topological properties of the geometries, though it may change some structural properties of the geometry collection such as indices of geometry elements (ST_GeometryN) and number of geometries in the geometry collection (ST_NumGeometries).

If I understand it correctly, both WKT and WKB support empty geometries in geometry collections according to the OGC standard Simple Feature Access - Part 1: Common Architecture. The standard does not explicitly forbid empty geometries inside geometry collections, and these kinds of geometries could be represented according to the BNF of WKT:

<multilinestring tagged text> ::= multilinestring <multilinestring text>
<multilinestring text> ::= <empty set> | <left parent> <linestring text> {<comma> <linestring text>}* <right parent>
<linestring text> ::= <empty set> | <left paren> <point> {<comma> <point>}* <right paren>
<empty set> ::= EMPTY

A MultiLineString containing 2 empty LineStrings could be represented as MULTILINESTRING (EMPTY, EMPTY). Both JTS and Shapely could successfully parse it, though their behavior varies in various aspects:

  • JTS 1.19.0 parses it as a MultiLineString object containing 2 empty LineStrings.
  • Shapely 2.0 parses it as a MultiLineString object containing 2 empty LineStrings, though Shapely does not allow creating MultiLineString with empty components (MultiLineString([LineString(), LineString()]) raises an exception).
  • Shapely 1.8 parses it as a MultiLineString containing no geometries.

WKB may also represent MULTILINESTRING (EMPTY, EMPTY) as b'\x00\x00\x00\x00\x05\x00\x00\x00\x02\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00', though some implementation failed to parse it.

  • JTS 1.19.0 raised a ParseException when parsing it.
  • Shapely 2.0 and Shapely 1.8 parsed it without a problem. Shapely 2.0 parsed it as a MultiLineString containing 2 empty LineStrings while Shapely 1.8 parsed it as an empty MultiLineString.

Based on the above observations, I prefer supporting empty geometries inside geometry collections to preserve their structural properties, since it may already be supported by various popular geometry representations, and users won't be surprised by accidental geometry structure changes.

@douglasdennis
Copy link
Contributor Author

I stand corrected :) I should have been a little more thorough. I think empty geometries in collections should be supported then, since the standards expressly provide for them.

@Imbruced
Copy link
Member

Imbruced commented Jan 8, 2023

Is the shapely section refering to wkb.loads and wkb.dumps methods ? What do you think about letting user to chose if the Z/M support should be turned on ? (I guess that majority of use cases are about 2D data). At the end we need sth low level written I am wondering if we can create some PR's in the shapely project ?

if coord_type == CoordinateType.XYM or coord_type == CoordinateType.XYZM:
raise NotImplementedError("XYM or XYZM coordinates are not supported")

if num_coords < 50:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that some magical number ? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I wrestled a leprechaun for it :)

In reality, it's roughly where the performances of the two approaches crossed each other on my system so it's fairly arbitrary. In my experience geometry either has a low number of points or a lot. Unfortunately, neither approach here performs well at both scales. So, 50 seemed like a reasonable "gonna call it here" kind of threshold.

I created a constant with a better name and am using that now.

@Imbruced
Copy link
Member

Imbruced commented Jan 8, 2023

Do you have a chance to compare the memory footprint between wkb.loads approach and the changes after 29.12 and the current version ?

@douglasdennis
Copy link
Contributor Author

Do you have a chance to compare the memory footprint between wkb.loads approach and the changes after 29.12 and the current version ?

I'd love too. Unfortunately, it won't be till tomorrow or early this next week.

@jiayuasu
Copy link
Member

@douglasdennis Is this PR ready to be merged? :-)

@douglasdennis
Copy link
Contributor Author

@Imbruced Took me a little longer but here are some memory profiling results.
TL;DR
Serialization Average Memory Increment Results (MiB):

Serializer Point LineString Polygon
WKB 7.22 13.82 13.76
Master 9.98 16.2 17.7
Refactor 7.1 13.6 14.98

Deserialization Average Memory Increment Results (MiB):

Deserializer Point LineString Polygon
WKB 28.5 39 46.8
Master 34.42 45.26 52.84
Refactor 34.8 45.26 52.8

Method:
I used the memory_profiler package to run trials using wkb.dumps and wkb.loads, this PR's refactored serialize and deserialize functions, and the current master branch's serialize and deserialize function. For each system of serde I ran trials operating on 100,000 Points, LineStrings with five vertices, and square Polygons. For each combination of serde and geometry I ran five trials and averaged the results. The script I used to run these trials is included below. Each script was ran using a command such as:
mprof run python python/mem_profile_serialize.py wkb point

This would give a printout like this:

mprof: Sampling memory every 0.1s
running new process
Filename: python/mem_profile_serialize.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    21     41.8 MiB     41.8 MiB           1   @profile
    22                                         def run_serialize_profile(serializer, geom):
    23     48.8 MiB      7.0 MiB      100003       x = [serializer(geom) for _ in range(100_000)]
    24     48.8 MiB      0.0 MiB           1       return x

I then averaged the "Increment" amount for line 23 across all five trials. Additionally, I plotted a representative run of the serialize operations (if you click on these you should get a bigger image):

Point LineString Polygon
WKB wkb_point wkb_linestring wkb_polygon
Master master_point master_linestring master_polygon
Refactor refactor_point refactor_linestring refactor_polygon

Lastly here is the serialization trial script:

import sys
from typing import List

from sedona.utils.geometry_serde import serialize
from shapely.geometry import Point, LineString, Polygon
from shapely.wkb import dumps

from memory_profiler import profile


def make_point():
    return Point(12.3, 45.6)

def make_linestring():
    return LineString([(n, n) for n in range(5)])

def make_polygon():
    return Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])


@profile
def run_serialize_profile(serializer, geom):
    x = [serializer(geom) for _ in range(100_000)]
    return x


def main(args: List[str]) -> int:
    if len(args) < 2:
        print("Usage: mem_profile_run.py <sedona or wkb> <point>")
        return 1
    serializer_type = args[0]
    geom_type = args[1]

    if geom_type == "point":
        geom = make_point()
    elif geom_type == "linestring":
        geom = make_linestring()
    elif geom_type == "polygon":
        geom = make_polygon()
    else:
        print(f"Geometry type is not supported: {geom_type}")
        return 1

    if serializer_type == "sedona":
        serializer = serialize
    elif serializer_type == "wkb":
        serializer = dumps
    else:
        print(f"Serializer type is not supported: {serializer_type}")
        return 1

    run_serialize_profile(serializer, geom)


if __name__ == "__main__":
    exit(main(sys.argv[1:]))

Deserialize trial script:

import sys
from typing import List

from sedona.utils.geometry_serde import serialize, deserialize
from shapely.geometry import Point, LineString, Polygon
from shapely.wkb import dumps, loads

from memory_profiler import profile


def make_point():
    return Point(12.3, 45.6)

def make_linestring():
    return LineString([(n, n) for n in range(5)])

def make_polygon():
    return Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])


@profile
def run_deserialize_profile(deserializer, geom):
    x = [deserializer(geom) for _ in range(100_000)]
    return x


def main(args: List[str]) -> int:
    if len(args) < 2:
        print("Usage: mem_profile_run.py <sedona or wkb> <point>")
        return 1
    serializer_type = args[0]
    geom_type = args[1]

    if geom_type == "point":
        geom = make_point()
    elif geom_type == "linestring":
        geom = make_linestring()
    elif geom_type == "polygon":
        geom = make_polygon()
    else:
        print(f"Geometry type is not supported: {geom_type}")
        return 1

    if serializer_type == "sedona":
        geom = serialize(geom)
        deserializer = deserialize
    elif serializer_type == "wkb":
        geom = dumps(geom)
        deserializer = loads
    else:
        print(f"Serializer type is not supported: {serializer_type}")
        return 1

    run_deserialize_profile(deserializer, geom)


if __name__ == "__main__":
    exit(main(sys.argv[1:]))

@douglasdennis
Copy link
Contributor Author

@jiayuasu I feel good about this PR. If the others do as well then I think it can merge. This code will probably have to be returned to at a later date once shapely decides how it wants to handle M coordinates and once Shapely < 2.0 is no longer supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants