[SEDONA-166] Support type-safe dataframe style API #693

douglasdennis · 2022-09-21T17:10:37Z

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the assoicated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-166. The PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

Introduce a type-safe DataFrame style API similar to standard Spark functions. The API can operate on Spark Column types directly or function specific native Scala/Java types. For example, the function ST_CollectionExtract can be called in the following ways:

val df = sparkSession.sql("SELECT ST_GeomFromWKT('GEOMETRYCOLLECTION(POINT(0 0), LINESTRING(0 0, 1 0))') AS geom")

// with Column objects and default geomType argument
df.select(ST_CollectionExtract($"geom")

// with a String to specify a column name
df.select(ST_CollectionExtract("geom")

// using Column objects and specifying the geomType argument
df.select(ST_CollectionExtract($"geom", lit(1))

// using a String for the column name and an Int for geomType
df.select(ST_CollectionExtract("geom", 1)

The general rule for the API is that methods have two overloaded forms:

All Column arguments which is the most versatile.
Strings for arguments that are commonly column names, and native types for arguments that are commonly constants. For example, the ST_CollectionExtract geomType argument (when given) is generally constant across the dataframe.

This API is made available to Scala, Java, and Python.

How was this patch tested?

Scala/Java is tested through unit tests in dataFrameAPITestScala.scala and Python is tested through unit tests in test_dataframe_api.py.

Did this PR include necessary documentation updates?

Yes, I am adding a new API. I am using the current SNAPSHOT version number in since vX.Y.Z format.
Yes, I have updated the documentation update.

jiayuasu · 2022-09-22T04:58:52Z

sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/st_aggregates.scala

+import org.apache.spark.sql.Column
+import org.apache.spark.sql.sedona_sql.expressions_udaf.{ST_Envelope_Aggr => ST_Envelope_Aggr_udaf, ST_Intersection_Aggr => ST_Intersection_Aggr_udaf, ST_Union_Aggr => ST_Union_Aggr_udaf}
+
+object st_aggregates extends DataFrameAPI {


Sedona aggregator is already using type-safe aggregator: https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/AggregateFunctions.scala#L76

The old UDAF-based aggregator (https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions_udaf/AggregateFunctions.scala#L27) is not used in Sedona for Spark 3.0 version: https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/sedona/sql/UDF/UdfRegistrator.scala#L44

Spark 2.X support will be dropped in Sedona 1.3.0 (next release). Therefore, I don't think we need this part. What do you think?

Are you asking if the st_aggregates object is needed? If so then I think it does. This allows for something like this:

df.groupBy().agg(ST_Union_Aggr("geometry").as("geometry"))

If that's not what you meant then apologies.

I didn't realize that UserDefinedAggregateFunction was depricated though. I was actually using this tutorial for reference to understand how this works on the Scala side: https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

It looks like from what else I've read that I'm supposed to pass an Aggregator to the udaf function, so I'll plan to do that.

UserDefinedAggregateFunction is deprecated since Spark 3.0 and Spark 3.0 suggests the type-safe aggregator: https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html#examples

So Sedona for Spark 3.0 uses the type-safe aggregator. Sedona for Spark 2.0 still uses UserDefinedAggregateFunction, which will be completely dropped in Sedona 1.3.0 (next release).

The URL you pasted actually uses the deprecated UserDefinedAggregateFunction as the example.

I'm shaking my fist at databricks right now...

This makes sense. I'll switch to using the type safe Aggregator classes and using the spark udaf function to allow it to take columns.

douglasdennis · 2022-09-22T05:11:14Z

This is pretty embarrassing. This was meant to PR against my own fork's master to get the build action to run. While most of these changes are approaching done (on the Scala side anyways), it was still subject to change as I'm sorting out the python and Java interface. The cat is out of the bag though so please consider this a draft for the moment.

jiayuasu · 2022-09-22T05:16:14Z

@douglasdennis Haha, no problem. This PR looks pretty nice. Keep up the good work!

douglasdennis · 2022-09-22T06:02:40Z

@jiayuasu Thank you :) I should also mention this is some of my first Scala I've written. So, if there is any language faux pas or better ways to accomplish what I'm doing, then let me know and I'll fix it.

jiayuasu · 2022-09-22T06:59:14Z

@douglasdennis Another interesting finding: since the type-safe dataframe APIs do not need to call "udf.register()" to register all functions, is it possible that, as a side effect, predicate pushdown is finally supported in Sedona?

If you don't know what a predicate pushdown is, see this: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-PushDownPredicate.html

Since Sedona 1.3.0 will natively read GeoParquet (PR already merged to the master), Sedona user RJ Marcus is working hard to get the predicate pushdown work on GeoParquet. See SEDONA-156.

Quoted my reply to RJ Marcus:

Pushed filter: UDF function can NOT be pushed down as a filter if they use udf.register(). Based on my understanding, Sedona ST functions cannot be pushed down because UDFs in pure Spark SQL are blackbox to Spark catalyst unless we do something with the current Sedona ST functions.

However, Sedona implements all functions in Spark SQL Catalyst "Expressions" [1] instead of the naive UDF. This gives you the possibility to push them down to the data source (see [2]). There is an ongoing effort to enable Sedona ST functions in type-safe format which bypasses the "udf.register" step (see [3])

So, with the current Sedona GeoParquet reader, and [3], it is possible that the Pushed filter will be finally supported. You might want to check it out and confirm my wild guess.

[1] https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Constructors.scala#L45

[2] https://neapowers.com/apache-spark/native-functions-catalyst-expressions/

[3] #693

Kimahriman · 2022-09-22T12:02:34Z

@douglasdennis Another interesting finding: since the type-safe dataframe APIs do not need to call "udf.register()" to register all functions, is it possible that, as a side effect, predicate pushdown is finally supported in Sedona?

I believe the current mechanism doesn't use UDFs technically, it registers SQL functions directly which should theoretically be usable by predicate pushdown (if you can figure out that mechanism in V1). That's why we can do things like join detection with pattern matching on the expressions

douglasdennis · 2022-09-22T18:07:55Z

Another interesting finding: since the type-safe dataframe APIs do not need to call "udf.register()" to register all functions, is it possible that, as a side effect, predicate pushdown is finally supported in Sedona?

@jiayuasu It does not appear to be so. Here are some results I ran this morning using the example1.parquet in the library.

Checking to make sure predicate pushdown happens with native types:

val geoparquetdatalocation1: String = resourceFolder + "geoparquet/example1.parquet"
val basicPredicateDf = sparkSession.read.format("geoparquet").load(geoparquetdatalocation1).where(col("name").equalTo("Fiji"))
basicPredicateDf.explain()

The plan shows a push down:

== Physical Plan ==
*(1) Filter (isnotnull(name#5302) AND (name#5302 = Fiji))
+- FileScan geoparquet [pop_est#5300L,continent#5301,name#5302,iso_a3#5303,gdp_md_est#5304,geometry#5305] Batched: false, DataFilters: [isnotnull(name#5302), (name#5302 = Fiji)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:<removed>, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Fiji)], ReadSchema: struct<pop_est:bigint,continent:string,name:string,iso_a3:string,gdp_md_est:double,geometry:array...

A simple geometry based predicate:

val geoparquetdatalocation1: String = resourceFolder + "geoparquet/example1.parquet"
val basicGeomPredicateDf = sparkSession.read.format("geoparquet").load(geoparquetdatalocation1).where(ST_GeometryType("geometry").equalTo(lit("ST_Polygon")))
basicGeomPredicateDf.explain()

This plan does not show a push down:

== Physical Plan ==
Filter (st_geometrytype(geometry#5318) = ST_Polygon)
+- FileScan geoparquet [pop_est#5313L,continent#5314,name#5315,iso_a3#5316,gdp_md_est#5317,geometry#5318] Batched: false, DataFilters: [(st_geometrytype(geometry#5318) = ST_Polygon)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:<removed>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<pop_est:bigint,continent:string,name:string,iso_a3:string,gdp_md_est:double,geometry:array...

And just to be thorough, a more complex geometry based predicate:

val geoparquetdatalocation1: String = resourceFolder + "geoparquet/example1.parquet"
val basicGeomPredicateDf = sparkSession.read.format("geoparquet").load(geoparquetdatalocation1).where(ST_Distance("geometry", ST_Point(2, 2)) <= (50.0))
basicGeomPredicateDf.explain()

As expected, no push down as well:

== Physical Plan ==
Filter ( **org.apache.spark.sql.sedona_sql.expressions.ST_Distance**   <= 50.0)
+- FileScan geoparquet [pop_est#5326L,continent#5327,name#5328,iso_a3#5329,gdp_md_est#5330,geometry#5331] Batched: false, DataFilters: [( **org.apache.spark.sql.sedona_sql.expressions.ST_Distance**   <= 50.0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:<removed>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<pop_est:bigint,continent:string,name:string,iso_a3:string,gdp_md_est:double,geometry:array...

neontty · 2022-09-22T18:28:25Z

@douglasdennis this looks like an awesome PR, thank you for putting in this hard work.

I agree that this PR won't shortcut to geometry based predicate pushdowns; the issue comes with handling Expressions/Predicates correctly during the FileScan since a lot of parquet code deals with "Filter" instead of Expression or Predicate. Thanks Dennis for doing this most recent test to confirm.

I'm a little bit busy today but I'm taking a look at Jia's comments on SEDONA-156 and making a response to continue the predicate-pushdown conversation in that thread.

douglasdennis · 2022-09-24T13:54:36Z

@jiayuasu This should be, at least structurally, complete. I still have docs and docstrings to complete. A couple questions:

I was planning to add a new section to this page and this page demonstrating how to use this API. Does that seem like a good way to go about it? Or would you prefer something else?
Should I add scaladoc to the Scala code?
I found sphinx style docstrings in some python code. Is that the style that I should use in Python?

I believe that @Imbruced is the Python reviewer, so I'm pinging them for this as well.

Any comments, suggestions, or code changes needed please let me know.

Imbruced · 2022-09-24T16:07:44Z

python/sedona/sql/dataframe_api.py

+    elif isinstance(arg, str):
+        return f.col(arg)._jc
+    elif isinstance(arg, Iterable):
+        return f.array(*[Column(x) for x in map(_convert_argument_to_java_column, arg)])._jc


Whats the advantage of using map and list comprehension in one place instead of applying two functions using list comprehension ?

No advantage. Just me being silly when I refactored. I originally wanted this to use recursion in a different way, but I wasn't feeling it so I defaulted to this out of haste :) Will refactor to use function composition. Thanks for catching that.

This is fixed.

Imbruced · 2022-09-24T16:08:14Z

python/sedona/sql/st_constructors.py

+]
+
+
+_call_constructor_function = partial(call_sedona_function, "st_constructors")


I like the idea of using partial :)

Imbruced · 2022-09-24T16:18:08Z

Great improvement ! The main idea is to provide the type safe function and what I am missing the most is to validate input types and verify against None values. Also I would like to have test cases for that. WDYT ? Maybe in another PR because this is massive :) Thanks for your effort.

Imbruced · 2022-09-24T16:14:06Z

python/tests/sql/test_dataframe_api.py

+
+class TestDataFrameAPI(TestBase):
+
+    def test_call_function_using_columns(self):


Do you think parametrised tests can help to simplify the tests ? I mean https://docs.pytest.org/en/6.2.x/parametrize.html ?

Then you can use argument list, function name to apply and also some fixtures to provide dataframe ?

Ooohhhh! I have not used parameterized tests before. I'm excited to give them a shot.

Alright. I gave this a shot. Let me know if it isn't what you were hoping for.

Also: I don't know what the proper etiquette is on github. Do I click the "Resolve Conversation" button when I think I did it or do the folks requesting code changes do that when they are satisfied?

douglasdennis · 2022-09-24T17:07:31Z

Great improvement ! The main idea is to provide the type safe function and what I am missing the most is to validate input types and verify against None values. Also I would like to have test cases for that. WDYT ? Maybe in another PR because this is massive :) Thanks for your effort.

Doh! I had meant to have input validation and I completely forgot about it. I'd like to add those in this PR just for completion. The jvm call will throw if it can't find a method that accepts the given argument types but that would be cryptic to the user. For reference, I intend to use a decorator to manage the type checking.

Oh, and, the tests are a great idea, will add those as well.

Imbruced · 2022-09-24T17:19:41Z

@douglasdennis Thanks ! Decorator sounds great ! It's good to have Python exception raised which can be easier handled and debugged later by users :)

jiayuasu · 2022-09-27T02:47:51Z

@douglasdennis Please let me know once you finish this PR :-)

jiayuasu · 2022-10-02T04:15:27Z

@douglasdennis Is this PR ready to merge? :-)

douglasdennis · 2022-10-02T05:10:37Z

@douglasdennis Is this PR ready to merge? :-)

Just working on a couple of the notes now. Will be done by the end of this weekend.

douglasdennis · 2022-10-02T15:44:57Z

Victory!

@jiayuasu As best I know this is ready to go, assuming I have addressed the notes from @Imbruced . I can also do a follow up PR after this merges if that would be better.

jiayuasu · 2022-10-02T18:33:49Z

This looks good to me. I will merge it for now as a few other PRs are waiting for this. If @Imbruced has any comments, @douglasdennis can make a follow-up PR to address it.

douglasdennis added 6 commits September 21, 2022 10:06

created and implemented DataFrameAPI.scala

f28961d

created and implemented st_constructors.scala

23fe82e

created and implemented st_functions.scala

3f8ca98

created and implemented st_predicates.scala

eb7d1c2

created and implemented st_aggregates.scala

7734013

implemented tests for DataFrame API functions

3e5db43

jiayuasu reviewed Sep 22, 2022

View reviewed changes

jiayuasu changed the title ~~Dataframe api~~ [SEDONA-166] Support type-safe dataframe style API Sep 22, 2022

jiayuasu added improvement affect public APIs sedona-sql labels Sep 22, 2022

jiayuasu added this to the sedona-1.3.0 milestone Sep 22, 2022

douglasdennis added 4 commits September 24, 2022 06:21

removed ColumnOrName typing

9427139

implemented python dataframe api

5fc85e1

change to Aggregator objects

0182133

add __pycache__ to gitignore

397c5db

Imbruced reviewed Sep 24, 2022

View reviewed changes

Imbruced requested changes Sep 24, 2022

View reviewed changes

jiayuasu mentioned this pull request Sep 27, 2022

[SEDONA-171] Add ST_SetPoint to Apache Sedona #694

Merged

douglasdennis added 2 commits October 1, 2022 19:08

add type checking and fix ST_PointN Optional type

99105db

convert to parameterized tests with fixtures

335476f

douglasdennis added 8 commits October 1, 2022 23:03

add ST_MakePolygon tests and default form

5d689a9

add docstrings to python functions

8459cb5

add ST_MakePolygon tests

c7c3714

added df api examples to docs

060d081

used function composition over nested map

e690fc1

replace get_origin with python 3.7 compatible method

0efc91a

removed typing.get_args to work with 3.7

373afc3

removed additional 3.8 typing code

0412353

jiayuasu added sedona-python sedona-flink labels Oct 2, 2022

jiayuasu approved these changes Oct 2, 2022

View reviewed changes

jiayuasu merged commit ff54bd3 into apache:master Oct 2, 2022

jiayuasu mentioned this pull request Oct 2, 2022

[SEDONA-172] Add ST_LineFromMultiPoint Function to SEDONA SQL API #695

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-166] Support type-safe dataframe style API #693

[SEDONA-166] Support type-safe dataframe style API #693

douglasdennis commented Sep 21, 2022 •

edited

Loading

jiayuasu Sep 22, 2022

douglasdennis Sep 22, 2022

jiayuasu Sep 22, 2022

douglasdennis Sep 22, 2022

douglasdennis commented Sep 22, 2022

jiayuasu commented Sep 22, 2022

douglasdennis commented Sep 22, 2022

jiayuasu commented Sep 22, 2022 •

edited

Loading

Kimahriman commented Sep 22, 2022

douglasdennis commented Sep 22, 2022

neontty commented Sep 22, 2022

douglasdennis commented Sep 24, 2022

Imbruced Sep 24, 2022

douglasdennis Sep 24, 2022

douglasdennis Oct 2, 2022

Imbruced Sep 24, 2022

Imbruced commented Sep 24, 2022

Imbruced Sep 24, 2022

Imbruced Sep 24, 2022

douglasdennis Sep 24, 2022

douglasdennis Oct 2, 2022

douglasdennis commented Sep 24, 2022 •

edited

Loading

Imbruced commented Sep 24, 2022

jiayuasu commented Sep 27, 2022

jiayuasu commented Oct 2, 2022

douglasdennis commented Oct 2, 2022

douglasdennis commented Oct 2, 2022

jiayuasu commented Oct 2, 2022

		]


		_call_constructor_function = partial(call_sedona_function, "st_constructors")


		class TestDataFrameAPI(TestBase):

		def test_call_function_using_columns(self):

[SEDONA-166] Support type-safe dataframe style API #693

[SEDONA-166] Support type-safe dataframe style API #693

Conversation

douglasdennis commented Sep 21, 2022 • edited Loading

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douglasdennis commented Sep 22, 2022

jiayuasu commented Sep 22, 2022

douglasdennis commented Sep 22, 2022

jiayuasu commented Sep 22, 2022 • edited Loading

Kimahriman commented Sep 22, 2022

douglasdennis commented Sep 22, 2022

neontty commented Sep 22, 2022

douglasdennis commented Sep 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Imbruced commented Sep 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douglasdennis commented Sep 24, 2022 • edited Loading

Imbruced commented Sep 24, 2022

jiayuasu commented Sep 27, 2022

jiayuasu commented Oct 2, 2022

douglasdennis commented Oct 2, 2022

douglasdennis commented Oct 2, 2022

jiayuasu commented Oct 2, 2022

douglasdennis commented Sep 21, 2022 •

edited

Loading

jiayuasu commented Sep 22, 2022 •

edited

Loading

douglasdennis commented Sep 24, 2022 •

edited

Loading