-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-132] Move some functions to a common module #647
Conversation
Started playing around with this and wanted to get some thoughts. Figured out some Scala magic to reduce the boilerplate needed for each Spark function. Only converted a few to get initial thoughts. Maybe can agree on an initial set to start with and others could slowly convert existing ones, and any new ones get added in the common module. |
.github/workflows/java.yml
Outdated
@@ -3,7 +3,7 @@ name: Scala and Java build | |||
on: | |||
push: | |||
branches: | |||
- master | |||
- '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just to get the CI to run before PR'ing it
@@ -0,0 +1,28 @@ | |||
package org.apache.sedona; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't sure if it should go in any further package from here. Definitely open to suggestions on a lot of the naming involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package name should probably be org.apache.sedona.common.
It's safer to let each maven module have their own package namespace. For instance OSGi runtimes have a strict requirement on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, will update
python-adapter/pom.xml
Outdated
<dependency> | ||
<groupId>org.apache.sedona</groupId> | ||
<artifactId>sedona-common</artifactId> | ||
<version>${project.version}</version> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured out I needed to add this here, but I've never figured out where the fat jar behavior comes from
@Kimahriman This sounds like a good idea to me! But there are way more functions need to move to this Sedona-common, such as spatial partitioning, serializer, format reader... I believe this will take lots of efforts |
@netanel246 @Imbruced @yitao-li Any opinions? |
.github/workflows/java.yml
Outdated
@@ -3,7 +3,7 @@ name: Scala and Java build | |||
on: | |||
push: | |||
branches: | |||
- master | |||
- '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change that ? :) What's the benefit from that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just temporary so I could get the tests to run before making the PR, will change back
return left.distance(right); | ||
} | ||
|
||
public static double ST_YMin(Geometry geometry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea about applying DRY to Sedona in that scenarios, but dont know if that naming is right in this context, maybe we should keep camel case here but only in SQL functions where we are forced we can use ST_* like naming ? WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that. Also curious about thoughts on package structure/naming
Yeah I just wanted to get the conversation started, figure out how to maybe break this up so it's not a massive effort for all the things in one PR, but maybe is done piece by piece (and at least start any new things in the common setup). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my 2 cents. Nice work!
import org.locationtech.jts.geom.Geometry; | ||
|
||
public class Functions { | ||
public static double ST_Distance(Geometry left, Geometry right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should functions in common do null checks or should that be done in the Spark and Flink bindings? I don't have a strong opinion but clarifying the contract for functions in common would be helpful for contributors.
Making functions in common null safe would be safer but each platform (currently Spark and Flink) has its own way of doing input validation so the null check might have to be repeated there. In Spark that is done by overriding checkInputDataTypes in expressions. It's currently not done in Sedona but would be a nice to add in the future. It would make error messages a lot more user friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it's really a matter of safety vs performance if the engine has a better way it can handle null checks. For example it would be pretty trivial to make a Spark codegen wrapper for the functions that only checks for nulls if the input is actually nullable. Really just depends if you want to be able to squeeze every last drop of performance out or not
Doing it piece by piece is probably a good idea but it would be nice to have a clear end goal. Spatial partitioning and the kryoserializer are not intertwined with other apis and should be pretty easy to move. |
How about limiting this PR to the "Functions" defined in Flink? And then make separate issues for other things Flink already has (predicates, constructors), and then other things can be done as they are used/added to Flink |
- Move to "common" package - Don't use ST_ for common funcs
260ff2a
to
9bb2d3e
Compare
I agree. Let's start with the functions in this PR. And then create other PRs for other functions. |
1 similar comment
I agree. Let's start with the functions in this PR. And then create other PRs for other functions. |
Ok I think I got all the Flink Functions. Rewriting the geohash logic in java was a bit of a pain, I wish everything could just be in Scala 😅 I didn't try to change any tests, was just gonna let the flink/spark-sql tests keep checking the logic for now |
@Kimahriman Do you think we should release sedona-common as a separate module or it is automatically packaged in other modules? |
So I was actually just playing around with that now that I think I understand how the modules currently work. Basically every dependency is provided scope except for python-adapter, which has basically everything bundled inside of it. Is there a reason it's setup that way? Any reason not to change it to what I think is the more traditional approach of just having compile scope dependencies? Except for the big stuff like spark, Hadoop, and Flink (and geotools for license reasons) Accidentally hit close merge request hah |
@Kimahriman Users were supposed to call Sedona modules individually. They can mix and match different modules. Each module does not have any compile scope dependencies. This is great for Scala/Java experts to easily manage dependencies. However, when @Imbruced initially designed the python-adapter for Sedona PySpark, he suggests that we should provide a fat jar solution for Python users because people in Python world are not familiar with these Maven packaging stuff. Therefore, we decided to put all dependencies (including all Sedona modules, JTS, GeoJson, excluding Geotools) into sedona-python-adapter. This python-adapter was supposed to be only used by Python users but later we found that many Scala/Java/R users also suffer from these packaging stuff. So on Sedona website, we recommend that new comers should use python-adapter jar unless they are confident about the packaging stuff. |
Yeah that's what I ended up doing all the time, I just include the python-adapter and then it automatically includes all the dependencies I need, which is a little odd. In fact, python-adapter has things bundled inside of it and has compile scope dependencies on all the things, so you end up double including all of the classes. From a Spark perspective (which is all I do, not Flink, and via Python, so not a Java/Scala/Maven expert by any means), I think the two main approaches to including dependencies are via Would you be open to switching to that type of setup? Then the "common" module would just be another compile scope module that gets automatically included for whatever needs it. |
@Kimahriman Yes, I think it will be good to make Sedona-common as a compile scope dependency to whatever package needs it. So the user no need to include it manually. As in this PR, sedona-core pom: Can you update the PR? Then I will merge this PR. Eventually I plan to release this PR to Sedona 1.3.0, not the next release 1.2.1. I think this PR is a rather big change to the project structure and is better not put in a maintenance release like 1.2.1 |
This broke the build for me because of the wrong/not matching version in: |
Yep this never got updated after the 1.2.1 release got cut, can you update it in your PR? |
Did so, see 231f98b |
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
What changes were proposed in this PR?
Begin creating a new module with common SQL functions across Spark and Flink
How was this patch tested?
Existing UTs
Did this PR include necessary documentation updates?