[SEDONA-132] Move some functions to a common module #647

Kimahriman · 2022-07-05T22:24:57Z

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the assoicated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-132.

What changes were proposed in this PR?

Begin creating a new module with common SQL functions across Spark and Flink

How was this patch tested?

Existing UTs

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the docs.

Kimahriman · 2022-07-05T22:26:31Z

Started playing around with this and wanted to get some thoughts. Figured out some Scala magic to reduce the boilerplate needed for each Spark function. Only converted a few to get initial thoughts. Maybe can agree on an initial set to start with and others could slowly convert existing ones, and any new ones get added in the common module.

Kimahriman · 2022-07-05T22:26:49Z

.github/workflows/java.yml

@@ -3,7 +3,7 @@ name: Scala and Java build
 on:
  push:
    branches:
-      - master
+      - '*'


This was just to get the CI to run before PR'ing it

Kimahriman · 2022-07-05T22:27:17Z

common/src/main/java/org/apache/sedona/Functions.java

@@ -0,0 +1,28 @@
+package org.apache.sedona;


Wasn't sure if it should go in any further package from here. Definitely open to suggestions on a lot of the naming involved.

The package name should probably be org.apache.sedona.common.

It's safer to let each maven module have their own package namespace. For instance OSGi runtimes have a strict requirement on that.

Fair point, will update

Kimahriman · 2022-07-06T00:20:16Z

python-adapter/pom.xml

+        <dependency>
+            <groupId>org.apache.sedona</groupId>
+            <artifactId>sedona-common</artifactId>
+            <version>${project.version}</version>
+        </dependency>


Figured out I needed to add this here, but I've never figured out where the fat jar behavior comes from

jiayuasu · 2022-07-09T07:43:04Z

@Kimahriman This sounds like a good idea to me! But there are way more functions need to move to this Sedona-common, such as spatial partitioning, serializer, format reader... I believe this will take lots of efforts

jiayuasu · 2022-07-09T07:43:35Z

@netanel246 @Imbruced @yitao-li Any opinions?

Imbruced · 2022-07-09T13:26:09Z

.github/workflows/java.yml

@@ -3,7 +3,7 @@ name: Scala and Java build
 on:
  push:
    branches:
-      - master
+      - '*'


Do we need to change that ? :) What's the benefit from that ?

Was just temporary so I could get the tests to run before making the PR, will change back

Imbruced · 2022-07-09T13:29:03Z

common/src/main/java/org/apache/sedona/Functions.java

+        return left.distance(right);
+    }
+
+    public static double ST_YMin(Geometry geometry) {


I like the idea about applying DRY to Sedona in that scenarios, but dont know if that naming is right in this context, maybe we should keep camel case here but only in SQL functions where we are forced we can use ST_* like naming ? WDYT ?

I'm fine with that. Also curious about thoughts on package structure/naming

Kimahriman · 2022-07-09T14:38:47Z

@Kimahriman This sounds like a good idea to me! But there are way more functions need to move to this Sedona-common, such as spatial partitioning, serializer, format reader... I believe this will take lots of efforts

Yeah I just wanted to get the conversation started, figure out how to maybe break this up so it's not a massive effort for all the things in one PR, but maybe is done piece by piece (and at least start any new things in the common setup).

umartin

Just my 2 cents. Nice work!

umartin · 2022-07-18T12:19:03Z

common/src/main/java/org/apache/sedona/Functions.java

+import org.locationtech.jts.geom.Geometry;
+
+public class Functions {
+    public static double ST_Distance(Geometry left, Geometry right) {


Should functions in common do null checks or should that be done in the Spark and Flink bindings? I don't have a strong opinion but clarifying the contract for functions in common would be helpful for contributors.

Making functions in common null safe would be safer but each platform (currently Spark and Flink) has its own way of doing input validation so the null check might have to be repeated there. In Spark that is done by overriding checkInputDataTypes in expressions. It's currently not done in Sedona but would be a nice to add in the future. It would make error messages a lot more user friendly.

Good point, it's really a matter of safety vs performance if the engine has a better way it can handle null checks. For example it would be pretty trivial to make a Spark codegen wrapper for the functions that only checks for nulls if the input is actually nullable. Really just depends if you want to be able to squeeze every last drop of performance out or not

umartin · 2022-07-18T13:47:50Z

Doing it piece by piece is probably a good idea but it would be nice to have a clear end goal.

Spatial partitioning and the kryoserializer are not intertwined with other apis and should be pretty easy to move.
I don't think it's strictly nessesary to move format readers. Some ST-constructors depend on format readers. They could be refactored to call JTS and geojson libraries directly. Then the format readers could use the functions in common where it makes sense.

Kimahriman · 2022-07-18T16:22:23Z

Doing it piece by piece is probably a good idea but it would be nice to have a clear end goal.

Spatial partitioning and the kryoserializer are not intertwined with other apis and should be pretty easy to move. I don't think it's strictly nessesary to move format readers. Some ST-constructors depend on format readers. They could be refactored to call JTS and geojson libraries directly. Then the format readers could use the functions in common where it makes sense.

How about limiting this PR to the "Functions" defined in Flink? And then make separate issues for other things Flink already has (predicates, constructors), and then other things can be done as they are used/added to Flink

- Move to "common" package - Don't use ST_ for common funcs

jiayuasu · 2022-07-18T20:03:35Z

Doing it piece by piece is probably a good idea but it would be nice to have a clear end goal.
Spatial partitioning and the kryoserializer are not intertwined with other apis and should be pretty easy to move. I don't think it's strictly nessesary to move format readers. Some ST-constructors depend on format readers. They could be refactored to call JTS and geojson libraries directly. Then the format readers could use the functions in common where it makes sense.

How about limiting this PR to the "Functions" defined in Flink? And then make separate issues for other things Flink already has (predicates, constructors), and then other things can be done as they are used/added to Flink

I agree. Let's start with the functions in this PR. And then create other PRs for other functions.

jiayuasu · 2022-07-18T20:04:07Z

Doing it piece by piece is probably a good idea but it would be nice to have a clear end goal.
Spatial partitioning and the kryoserializer are not intertwined with other apis and should be pretty easy to move. I don't think it's strictly nessesary to move format readers. Some ST-constructors depend on format readers. They could be refactored to call JTS and geojson libraries directly. Then the format readers could use the functions in common where it makes sense.

How about limiting this PR to the "Functions" defined in Flink? And then make separate issues for other things Flink already has (predicates, constructors), and then other things can be done as they are used/added to Flink

I agree. Let's start with the functions in this PR. And then create other PRs for other functions.

Kimahriman · 2022-07-20T16:04:41Z

Ok I think I got all the Flink Functions. Rewriting the geohash logic in java was a bit of a pain, I wish everything could just be in Scala 😅 I didn't try to change any tests, was just gonna let the flink/spark-sql tests keep checking the logic for now

jiayuasu · 2022-07-23T00:59:28Z

@Kimahriman Do you think we should release sedona-common as a separate module or it is automatically packaged in other modules?

Kimahriman · 2022-07-23T03:11:14Z

So I was actually just playing around with that now that I think I understand how the modules currently work. Basically every dependency is provided scope except for python-adapter, which has basically everything bundled inside of it. Is there a reason it's setup that way? Any reason not to change it to what I think is the more traditional approach of just having compile scope dependencies? Except for the big stuff like spark, Hadoop, and Flink (and geotools for license reasons)

Accidentally hit close merge request hah

jiayuasu · 2022-07-23T04:18:01Z

@Kimahriman Users were supposed to call Sedona modules individually. They can mix and match different modules. Each module does not have any compile scope dependencies. This is great for Scala/Java experts to easily manage dependencies.

However, when @Imbruced initially designed the python-adapter for Sedona PySpark, he suggests that we should provide a fat jar solution for Python users because people in Python world are not familiar with these Maven packaging stuff. Therefore, we decided to put all dependencies (including all Sedona modules, JTS, GeoJson, excluding Geotools) into sedona-python-adapter.

This python-adapter was supposed to be only used by Python users but later we found that many Scala/Java/R users also suffer from these packaging stuff. So on Sedona website, we recommend that new comers should use python-adapter jar unless they are confident about the packaging stuff.

Kimahriman · 2022-07-23T13:10:41Z

Yeah that's what I ended up doing all the time, I just include the python-adapter and then it automatically includes all the dependencies I need, which is a little odd. In fact, python-adapter has things bundled inside of it and has compile scope dependencies on all the things, so you end up double including all of the classes.

From a Spark perspective (which is all I do, not Flink, and via Python, so not a Java/Scala/Maven expert by any means), I think the two main approaches to including dependencies are via --packages (or spark.jars.packages), which resolves and automatically includes compile scope dependencies, or creating a fat Jar if you're spark-submitting java/scala code. Either way lends itself to compile scope dependencies correctly, versus users having to know what transitive dependencies are required for them to manually include (and what version they should use). So if everything was just compile scope, and you just wanted to do Sedona SQL things, you could just do --packages org.apache.sedona:sedona-sql-3.0_2.12:1.2.0 and have everything you need without worrying about anything else.

Would you be open to switching to that type of setup? Then the "common" module would just be another compile scope module that gets automatically included for whatever needs it.

jiayuasu · 2022-07-25T03:38:06Z

@Kimahriman Yes, I think it will be good to make Sedona-common as a compile scope dependency to whatever package needs it. So the user no need to include it manually.

As in this PR,

sedona-core pom: needs sedona-common, as a compile scope dependency
sedona-sql pom: no update needed because it needs sedona-core (provided scope) for spatial join and kyro serializer
sedona-viz pom: no update needed because it needs sedona-core ans sedona-sql (provided scope) for spatial join and kyro serializer
sedona-python-adapter pom: no update needed because it needs sedona-core and sedona-sql (compile scope)
sedona-flink: needs sedona-common as a compile scope dependency. It currently also needs sedona-core and sedona-sql (provided scope). Note that: sedona-core and sedona-sql dependencies will eventually be dropped as we move all common functions to sedona-common in the next few PRs).

Can you update the PR? Then I will merge this PR.

Eventually I plan to release this PR to Sedona 1.3.0, not the next release 1.2.1. I think this PR is a rather big change to the project structure and is better not put in a maintenance release like 1.2.1

Elephantusparvus · 2022-08-16T12:29:11Z

This broke the build for me because of the wrong/not matching version in:
https://github.com/apache/incubator-sedona/blob/f2e61b85dc95235bffd5aa49a8db35ad0735f1a7/common/pom.xml#L25
It should be <version>1.3.0-incubating-SNAPSHOT</version>.

Kimahriman · 2022-08-16T13:30:52Z

This broke the build for me because of the wrong/not matching version in:

https://github.com/apache/incubator-sedona/blob/f2e61b85dc95235bffd5aa49a8db35ad0735f1a7/common/pom.xml#L25

It should be <version>1.3.0-incubating-SNAPSHOT</version>.

Yep this never got updated after the 1.2.1 release got cut, can you update it in your PR?

Elephantusparvus · 2022-08-16T14:24:36Z

This broke the build for me because of the wrong/not matching version in:
https://github.com/apache/incubator-sedona/blob/f2e61b85dc95235bffd5aa49a8db35ad0735f1a7/common/pom.xml#L25

It should be <version>1.3.0-incubating-SNAPSHOT</version>.

Yep this never got updated after the 1.2.1 release got cut, can you update it in your PR?

Did so, see 231f98b

Kimahriman commented Jul 5, 2022

View reviewed changes

Kimahriman commented Jul 6, 2022

View reviewed changes

jiayuasu added improvement attention needed in progress labels Jul 9, 2022

Imbruced reviewed Jul 9, 2022

View reviewed changes

umartin reviewed Jul 18, 2022

View reviewed changes

Kimahriman added 5 commits July 18, 2022 11:41

Move some functions to a common module

d356c97

Add comment about the explicit type tag

b30ffda

Add common to scala 2.11 profile

6aa22b8

Add common to the python adapter

3000e7a

- Revert worfklows

9bb2d3e

- Move to "common" package - Don't use ST_ for common funcs

Kimahriman force-pushed the feature/common-funcs branch from 260ff2a to 9bb2d3e Compare July 18, 2022 17:41

Kimahriman added 6 commits July 19, 2022 07:56

Knock out a few more functions

60f0b0b

Migrate GeomUtils to common

727151a

Finish remaining flink functions

41796a6

Fix uses for geo hash encoder

d65e3d2

Try to fix viz and R tests

71c5198

Fix flink test and actually fix R tests

ed42e4f

Kimahriman changed the title ~~WIP: [SEDONA-132] Move some functions to a common module~~ [SEDONA-132] Move some functions to a common module Jul 20, 2022

jiayuasu added this to the sedona-1.3.0 milestone Jul 23, 2022

Kimahriman closed this Jul 23, 2022

Kimahriman reopened this Jul 23, 2022

Make common module compile scope

39804d3

jiayuasu approved these changes Jul 27, 2022

View reviewed changes

This was referenced Aug 7, 2022

[SEDONA-138] Fix ST_GeoHash for Flink to work #658

Merged

[SEDONA-137] Fix ST_Buffer for Flink to work #657

Merged

[SEDONA-136] Enable testAsEWKT for Flink #656

Merged

jiayuasu merged commit f2e61b8 into apache:master Aug 16, 2022

Kimahriman mentioned this pull request Aug 16, 2022

[SEDONA-142] Add ST_Collect to Flink Catalog #662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-132] Move some functions to a common module #647

[SEDONA-132] Move some functions to a common module #647

Kimahriman commented Jul 5, 2022 •

edited

Loading

Kimahriman commented Jul 5, 2022

Kimahriman Jul 5, 2022

Kimahriman Jul 5, 2022

umartin Jul 18, 2022

Kimahriman Jul 18, 2022

Kimahriman Jul 6, 2022

jiayuasu commented Jul 9, 2022

jiayuasu commented Jul 9, 2022

Imbruced Jul 9, 2022 •

edited

Loading

Kimahriman Jul 9, 2022

Imbruced Jul 9, 2022

Kimahriman Jul 9, 2022

Kimahriman commented Jul 9, 2022

umartin left a comment

umartin Jul 18, 2022

Kimahriman Jul 18, 2022

umartin commented Jul 18, 2022

Kimahriman commented Jul 18, 2022

jiayuasu commented Jul 18, 2022

jiayuasu commented Jul 18, 2022

Kimahriman commented Jul 20, 2022

jiayuasu commented Jul 23, 2022

Kimahriman commented Jul 23, 2022 •

edited

Loading

jiayuasu commented Jul 23, 2022

Kimahriman commented Jul 23, 2022

jiayuasu commented Jul 25, 2022

Elephantusparvus commented Aug 16, 2022 •

edited

Loading

Kimahriman commented Aug 16, 2022

Elephantusparvus commented Aug 16, 2022

[SEDONA-132] Move some functions to a common module #647

[SEDONA-132] Move some functions to a common module #647

Conversation

Kimahriman commented Jul 5, 2022 • edited Loading

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Kimahriman commented Jul 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayuasu commented Jul 9, 2022

jiayuasu commented Jul 9, 2022

Imbruced Jul 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kimahriman commented Jul 9, 2022

umartin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

umartin commented Jul 18, 2022

Kimahriman commented Jul 18, 2022

jiayuasu commented Jul 18, 2022

jiayuasu commented Jul 18, 2022

Kimahriman commented Jul 20, 2022

jiayuasu commented Jul 23, 2022

Kimahriman commented Jul 23, 2022 • edited Loading

jiayuasu commented Jul 23, 2022

Kimahriman commented Jul 23, 2022

jiayuasu commented Jul 25, 2022

Elephantusparvus commented Aug 16, 2022 • edited Loading

Kimahriman commented Aug 16, 2022

Elephantusparvus commented Aug 16, 2022

Kimahriman commented Jul 5, 2022 •

edited

Loading

Imbruced Jul 9, 2022 •

edited

Loading

Kimahriman commented Jul 23, 2022 •

edited

Loading

Elephantusparvus commented Aug 16, 2022 •

edited

Loading