[SEDONA-212] Rework dependency management #735

Kimahriman · 2022-12-15T02:51:30Z

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the assoicated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-212. The PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

Using the problem of shading to rethink how dependencies are managed. Definitely trying to start a discussion and this is a WIP, finally got the tests to pass.

The general way things are setup now is "everything is a provided dependency, except for python-adapter where everything is shaded". This presents a few problems mentioned in the ticket:

It's very difficult to deal with dependency conflicts if one of the conflicts is in a shaded jar. You can't simply exclude the transitive dependency
It's very difficult to actually use the package. What I assume most people do is default to just using the python-adapter module so they get all the things (assuming they don't have problems with the shading). This is very awkward, especially if you aren't using Python. For example, we purely use the SQL functions in the sql module, but without knowing what sub-dependencies I need, I just have to default to the python-adapter module so I have the dependencies I need.
The python-adapter module has all the dependencies shaded, plus they are still compile scope dependencies in the result pom, resulting in double including all the dependencies when using Java based build tools or the spark --packages option. This is a bug due to the resolved-pom-maven-plugin and the maven-shade-plugin not playing nice together, so the dependency-reduced-pom doesn't end up as the actual pom for the module.

Because of all that, I'm proposing a few changes in this PR:

Anything that should be a compile dependency is scoped as such, except for geotools for the licensing reasons.
Dependencies are defined in a dependencyManagement section in the parent pom instead of directly defining dependencies for all modules that might only be relevant to one. Each module includes just the relevant dependencies it needs.
Scala plugins are all opt-in to make it clear/sure that the common module does not have any odd Scala dependencies even for testing.
python-adapter still includes shading, but in a separate target used for the Python tests.

I did have to work through various exclusions to get the tests to work. I was able to remove the defining the jackson version for all testing purposes.

Questions to answer would be:

Is there any reason to actually publish a shaded module? Anything Spark related can either:
- Java/Scala: use their Java build tool of choice to include the right dependencies in their subsequent jar
- Python: use the --packages feature to automatically pull all required transitive dependencies
I don't know anything about Flink and if there is a use case for that needing a shaded module
What to do with the new edu.ucar module? It looks like it's in its own private repo, so for those behind a proxy that don't have access to every repo on the internet, this could cause dependency resolution to fail.

How was this patch tested?

Existing tests.

Did this PR include necessary documentation updates?

There may have to be documentation updates if this route is accepted, would need to look at what the current examples do.

jiayuasu · 2022-12-18T06:13:59Z

@Kimahriman

Please feel free to correct me if I am wrong since I am not 100% confident about the current POM setting.

The reason why we provide a shaded jar for python-adapter is that some Sedona users actually manually download the jar from ASF dist and manually upload it to its environment. Since this step has no maven involved, a shaded jar will be helpful for them. This happens to both Scala/Java/Python users in Spark and Flink.
The majority of Flink users are using Java and pure SQL. Some of them might use Scala. Few of them are using Python.
I can create something like edu-ucar wrapper to bring it to the Maven Central. But I don't want to put it to compile scope because (1) it depends on the Google guava which might have incompatible API across versions (2) it is quite large (3.3 MB itself) and only a few people use it.

BTW, in Sedona doc, we explain that what dependency should be called if you want to use Sedona jar with others: https://sedona.apache.org/setup/maven-coordinates/#use-sedona-and-third-party-jars-separately

Your new proposal

Based on your proposal, users will now interact with the following jars. Can you please confirm?

All Spark / Flink dependency must be provided in any cases.

Spark users:

Scala/Java user option 1: use Maven to manage dependency
- sedona-core: with some compile scope dependencies but not shaded
- sedona-sql: with some compile scope dependencies but not shaded
- sedona-viz: with some compile scope dependencies but not shaded
  Users no longer need to manually add other dependencies (jts, geojson2jts, geotools-wrapper). geotools-wrapper is now a compile scope dependency.
Scala/Java user option 2: manually download jar and upload jar, no Maven
- sedona-python-adapter: a shaded jar like what we have now
- geotools-wrapper: a shaded jar that consists of all geotools dependency
Python user option: follow Scala/Java user option 2

Flink users:

Scala/Java user option 1: use Maven to manage dependency
- sedona-core: with some compile scope dependencies but not shaded
- sedona-flink: with some compile scope dependencies but not shaded
Scala/Java user option 2: manually download jar and upload jar, no Maven
- sedona-python-adapter: a shaded jar like what we have now
- geotools-wrapper: a shaded jar that consists of all geotools dependency

Final thoughts

It was a headache for me to get the POM right for publishing a release. Before we accept such a big change to the POM, we need to test by publishing SNAPSHOTS (https://sedona.apache.org/community/snapshot/) and try with the existing projects in example folder

Kimahriman · 2022-12-18T12:57:36Z

Yeah I think that mostly covers it. We can keep a shaded jar for Spark and/or Flink for people who need it, and that can either be the python-adapter module or a new specifically shaded module since there's no reason the python-adapter needs to be shaded and it might be better to have an explicit module named "shaded". But for people who are using dependency management systems (either by building custom jars or using --packages in Spark), which should be what most people are doing and the common case, we use properly compile scoped dependencies where possible, while providing whatever shaded artifacts we need for the edge cases where users don't have the network connections available to load all the dependencies automatically.

Definitely will need to do a lot of test publishing with this to verify the resulting poms, but I hope it will start make the dependencies easier to manage and understand over time, and be less likely to have issues with conflicting scala versions and such.

I am not a pom or maven expert by any means (learned a lot of this as I went), so definitely looking for any feedback. This is how I see a lot of other projects structured, and I tried to pull a lot of things from how the Spark poms themselves were structured.

umartin · 2022-12-19T11:19:06Z

Nice work! Over all this looks good to me. Just a few notes:

Dependencies with compile scope are inherited. So python-adapter doesn't need to depend on sql, core and common. Only sql is needed. Same goes for the other modules. Maybe there is a reason why you had to repeat the dependencies that I'm missing. If so, ignore this comment :)
I think that separate shaded modules is a good idea. It would be nice to be able to opt out of shaded modules if you want to. In python and R projects you can just add the --packages flag to spark-submit or set the configuration property spark.jars.packages. Then spark will download any artifacts and their dependencies for you. Shading is never needed in Spark, regardless of which language you use, but it might be convenient depending on how you deploy your jobs. See https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

If you're not that familiar with maven, the command mvn dependency:tree is a nice way to verify transitive dependencies.
Verifying the shaded artifacts is harder. I would probably run jar -tf sedona-spark-shaded.jar | sort before and after this patch and diff the output.

@jiayuasu publishing snapshots sounds lika a good idea

Kimahriman · 2022-12-19T12:45:49Z

Dependencies with compile scope are inherited. So python-adapter doesn't need to depend on sql, core and common. Only sql is needed. Same goes for the other modules. Maybe there is a reason why you had to repeat the dependencies that I'm missing. If so, ignore this comment :)

This was mostly intentional, I tried to include everything that was directly imported by the package to be more explicit, and not rely on expecting transitive dependencies to be there. I feel like that's the recommended practice? That being said I'm sure this isn't 100% true across the whole codebase right now. I feel like there are maven plugins you can use to check that potentially?

For the shaded modules do you think completely separate modules, like sedona-spark-shaded and sedona-flink-shaded? Or just like a classifier for the python adapter like this currently has I think of org.apache.sedona:sedona-python-adapter-3.0_2.12:1.3.1-incubating-SNAPSHOT:shaded

umartin · 2022-12-19T13:48:11Z

This was mostly intentional, I tried to include everything that was directly imported by the package to be more explicit, and not rely on expecting transitive dependencies to be there. I feel like that's the recommended practice? That being said I'm sure this isn't 100% true across the whole codebase right now. I feel like there are maven plugins you can use to check that potentially?

I think you are right. Sorry! I checked spark. They include dependencies explicitly. In https://github.com/apache/spark/blob/master/sql/core/pom.xml the dependency on sql-catalyst already pulls in core but they have listed it explicitly any way.

For the shaded modules do you think completely separate modules, like sedona-spark-shaded and sedona-flink-shaded? Or just like a classifier for the python adapter like this currently has I think of org.apache.sedona:sedona-python-adapter-3.0_2.12:1.3.1-incubating-SNAPSHOT:shaded

For shading I'm thinking separate modules. I don't know what the pros and cons are but it feels more natural to me. With qualifiers you probably have to build twice to get both the shaded and non-shaded jars.

jiayuasu · 2022-12-22T07:37:32Z

@Kimahriman @umartin Thank you both for the great suggestions.

@Kimahriman Let's proceed with your proposal. In addition, let's create two completely separate modules sedona-spark-shaded (for both Scala 2.12 and Scala 2.13) and sedona-flink-shaded (only for Scala 2.12)

After this, Adam you can try to publish SNAPSHOTs to ASF repo. You are a Sedona PMC so you have the permission to upload SNAPSHOTs. I can provide some instruction and scripts for you to upload SNAPSHOTs.

Kimahriman · 2022-12-22T14:00:11Z

Got an initial version of the shaded modules working. Need to see if tests still pass (mostly python ones, using the shaded spark module to run those now). After seeing how that works I'll look into publishing snapshots

…dules

Kimahriman · 2022-12-26T15:22:36Z

I think I have this in a good place and builds working, except for the examples. It looks like those don't use any local builds of sedona to build the examples, and just relies on a snapshot version being published? Is that right?

Ready for anyone to review, and I can publish a snapshot version whenever. Do I just need to add a distribution management section to the pom temporarily? I think I have all my creds setup correctly.

Only other thing would be updating docs, could try to do that here or in a separate PR.

Kimahriman · 2022-12-26T15:32:14Z

Couriser dependencies for each module now:

common:

com.fasterxml.jackson.core:jackson-annotations:2.12.2:default
com.fasterxml.jackson.core:jackson-core:2.12.2:default
com.fasterxml.jackson.core:jackson-databind:2.12.2:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.locationtech.jts:jts-core:1.19.0:default
org.wololo:jts2geojson:0.16.1:default

core:

commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.wololo:jts2geojson:0.16.1:default

sql:

commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-sql-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.scala-lang.modules:scala-collection-compat_2.12:2.5.0:default
org.wololo:jts2geojson:0.16.1:default

viz:

commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-sql-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-viz-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.beryx:awt-color-factory:1.0.0:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.scala-lang.modules:scala-collection-compat_2.12:2.5.0:default
org.wololo:jts2geojson:0.16.1:default

python-adapter:

commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-python-adapter-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-sql-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.scala-lang.modules:scala-collection-compat_2.12:2.5.0:default
org.wololo:jts2geojson:0.16.1:default

flink:

com.fasterxml.jackson.core:jackson-annotations:2.12.2:default
com.fasterxml.jackson.core:jackson-core:2.12.2:default
com.fasterxml.jackson.core:jackson-databind:2.12.2:default
commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-flink_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-sql-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.jheaps:jheaps:0.14:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.scala-lang.modules:scala-collection-compat_2.12:2.5.0:default
org.wololo:jts2geojson:0.16.1:default

spark-shaded:

org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.3.2-incubating-SNAPSHOT:default

flink-shaded:

org.apache.sedona:sedona-flink-shaded_2.12:1.3.2-incubating-SNAPSHOT:default

These should be what get included by using the --packages option in spark or by building your own fat jar

jiayuasu · 2022-12-28T07:53:14Z

.github/workflows/java.yml

@@ -3,7 +3,7 @@ name: Scala and Java build
 on:
  push:
    branches:
-      - master


Please don't change it to *. CI is supposed to only test the master branch and pull requests because all repos under ASF share 150 VMs sponsored by GitHub and we want to leave some resources to other projects.

I always change this locally so that I can get the CI to run before making a PR, will change this back. Does that only apply to things running under the apache group? When it runs on my own GitHub actions does it use the same vms?

jiayuasu · 2022-12-28T07:53:37Z

.github/workflows/python.yml

@@ -3,7 +3,7 @@ name: Python build
 on:
  push:
    branches:
-      - master


Same here. Use master please.

jiayuasu · 2022-12-28T08:07:09Z

@Kimahriman I have updated the example projects. If you pull the latest change, it will pass the CI.

I suppose you have read: https://sedona.apache.org/1.3.1-incubating/community/release-manager/

Now you can try to publish the snapshots: https://sedona.apache.org/1.3.1-incubating/community/release-manager/

Once you publish the snapshots, you can try them in the example projects by using sedona-1.3.2-incubating-SNAPSHOT as the version.

Kimahriman · 2022-12-28T17:59:02Z

@Kimahriman I have updated the example projects. If you pull the latest change, it will pass the CI.

I suppose you have read: https://sedona.apache.org/1.3.1-incubating/community/release-manager/

Now you can try to publish the snapshots: https://sedona.apache.org/1.3.1-incubating/community/release-manager/

Once you publish the snapshots, you can try them in the example projects by using sedona-1.3.2-incubating-SNAPSHOT as the version.

Published 1.3.2-incubating-SNAPSHOT to the snapshot repo

jiayuasu · 2022-12-30T03:46:20Z

@Kimahriman Did you try the SNAPSHOT in example projects?

Kimahriman · 2022-12-30T12:56:46Z

@Kimahriman Did you try the SNAPSHOT in example projects?

Yeah I've been trying but running into some weird things that may just be issues with my local environment, will keep looking into this weekend

Kimahriman · 2022-12-30T17:06:36Z

Had an old shaded jar in my spark jars folder still, examples are working off of the published snapshot versions now

jiayuasu

I think this looks good to me, except the java.net repo.

jiayuasu · 2023-01-01T19:55:15Z

pom.xml

+                <scope>test</scope>
+            </dependency>
+        </dependencies>
+    </dependencyManagement>
    <repositories>
        <repository>
            <id>maven2-repository.dev.java.net</id>


One last piece: I think java.net repo is no longer needed, right? I think it was used by some geotools dependencies. But now, after I remove this repo, the compilation still works.

douglasdennis · 2023-01-01T20:09:50Z

For what it's worth and from a user's perspective, I think this looks awesome and I love the idea of an explicitly named shaded jar over the python adapter one. I just noted several comments about only needing Spark for dependency management (like using --packages), and wanted to advocate for projects that cannot use Spark in that way.

The simplest example is a cluster that is disconnected from the internet in general and can only rely on the jars in the local system. In that case, a shaded jar is a very pleasant thing to have. Especially when the build pipeline is only setup for python or R.

Kimahriman added 11 commits December 13, 2022 19:49

Rework dependencies

bd7fd8b

Add guava exclusion

5c17662

Add missing optionals

c4b1ee4

Core needs spark-sql now

8dc19b0

Fix some geotools things and use separate shaded artifact

13b76a2

More rework

10190c9

Comment out jackson and make scala opt-in

f12b7da

Fix datasyslab deps

caf1844

Add epsg to python-adapter

2b1e724

Actually include scala code in python-adapter

c4d85fd

Get mini hdfs test working

55911a1

jiayuasu added improvement attention needed labels Dec 18, 2022

Create separate shaded modules

3f13c14

Kimahriman added 3 commits December 22, 2022 13:06

Search for shaded module too to get version

8d54691

Add geotools to spark-shaded for python tests

c09ed45

Make the cdm dependency provided by default and included in shaded mo…

64fc58b

…dules

Kimahriman changed the title ~~[WIP][SEDONA-212] Rework dependency management~~ [SEDONA-212] Rework dependency management Dec 25, 2022

Kimahriman added 4 commits December 25, 2022 20:11

Re-enable resolve pom plugin for python-adapter

588dabb

FIx guava for python-adapter tests

4b50ec8

Merge branch 'master' into rework-deps2

06320de

Fix new module versions

8c0e1fd

Remove spark compat version from flink shaded

eb4ce52

jiayuasu requested changes Dec 28, 2022

View reviewed changes

jiayuasu added this to the sedona-1.4.0 milestone Dec 28, 2022

Kimahriman added 2 commits December 28, 2022 09:49

Merge branch 'master' into rework-deps2

cf4aae3

Revert CI triggers

cd4ab98

Kimahriman added 2 commits December 31, 2022 17:56

Skip deploying parent pom with scala 2.13 profile

908a41b

Merge branch 'master' into rework-deps2

d8ede5f

Kimahriman mentioned this pull request Dec 31, 2022

[SEDONA-207] Implemented a new geometry serde for supporting Z/M dimensions and better performance #739

Merged

Merge branch 'master' into rework-deps2

c56cd6e

jiayuasu approved these changes Jan 1, 2023

View reviewed changes

Remove java.net repo

472f56e

jiayuasu approved these changes Jan 2, 2023

View reviewed changes

jiayuasu merged commit 91af711 into apache:master Jan 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-212] Rework dependency management #735

[SEDONA-212] Rework dependency management #735

Kimahriman commented Dec 15, 2022

jiayuasu commented Dec 18, 2022 •

edited

Loading

Kimahriman commented Dec 18, 2022

umartin commented Dec 19, 2022

Kimahriman commented Dec 19, 2022 •

edited

Loading

umartin commented Dec 19, 2022

jiayuasu commented Dec 22, 2022

Kimahriman commented Dec 22, 2022

Kimahriman commented Dec 26, 2022

Kimahriman commented Dec 26, 2022

jiayuasu Dec 28, 2022

Kimahriman Dec 28, 2022

jiayuasu Dec 28, 2022

jiayuasu commented Dec 28, 2022

Kimahriman commented Dec 28, 2022

jiayuasu commented Dec 30, 2022

Kimahriman commented Dec 30, 2022

Kimahriman commented Dec 30, 2022

jiayuasu left a comment

jiayuasu Jan 1, 2023

Kimahriman Jan 1, 2023

douglasdennis commented Jan 1, 2023

[SEDONA-212] Rework dependency management #735

[SEDONA-212] Rework dependency management #735

Conversation

Kimahriman commented Dec 15, 2022

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

jiayuasu commented Dec 18, 2022 • edited Loading

Your new proposal

Spark users:

Flink users:

Final thoughts

Kimahriman commented Dec 18, 2022

umartin commented Dec 19, 2022

Kimahriman commented Dec 19, 2022 • edited Loading

umartin commented Dec 19, 2022

jiayuasu commented Dec 22, 2022

Kimahriman commented Dec 22, 2022

Kimahriman commented Dec 26, 2022

Kimahriman commented Dec 26, 2022

jiayuasu Dec 28, 2022

Choose a reason for hiding this comment

Kimahriman Dec 28, 2022

Choose a reason for hiding this comment

jiayuasu Dec 28, 2022

Choose a reason for hiding this comment

jiayuasu commented Dec 28, 2022

Kimahriman commented Dec 28, 2022

jiayuasu commented Dec 30, 2022

Kimahriman commented Dec 30, 2022

Kimahriman commented Dec 30, 2022

jiayuasu left a comment

Choose a reason for hiding this comment

jiayuasu Jan 1, 2023

Choose a reason for hiding this comment

Kimahriman Jan 1, 2023

Choose a reason for hiding this comment

douglasdennis commented Jan 1, 2023

jiayuasu commented Dec 18, 2022 •

edited

Loading

Kimahriman commented Dec 19, 2022 •

edited

Loading