Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-212] Rework dependency management #735

Merged
merged 26 commits into from
Jan 2, 2023

Conversation

Kimahriman
Copy link
Contributor

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

Using the problem of shading to rethink how dependencies are managed. Definitely trying to start a discussion and this is a WIP, finally got the tests to pass.

The general way things are setup now is "everything is a provided dependency, except for python-adapter where everything is shaded". This presents a few problems mentioned in the ticket:

  • It's very difficult to deal with dependency conflicts if one of the conflicts is in a shaded jar. You can't simply exclude the transitive dependency
  • It's very difficult to actually use the package. What I assume most people do is default to just using the python-adapter module so they get all the things (assuming they don't have problems with the shading). This is very awkward, especially if you aren't using Python. For example, we purely use the SQL functions in the sql module, but without knowing what sub-dependencies I need, I just have to default to the python-adapter module so I have the dependencies I need.
  • The python-adapter module has all the dependencies shaded, plus they are still compile scope dependencies in the result pom, resulting in double including all the dependencies when using Java based build tools or the spark --packages option. This is a bug due to the resolved-pom-maven-plugin and the maven-shade-plugin not playing nice together, so the dependency-reduced-pom doesn't end up as the actual pom for the module.

Because of all that, I'm proposing a few changes in this PR:

  • Anything that should be a compile dependency is scoped as such, except for geotools for the licensing reasons.
  • Dependencies are defined in a dependencyManagement section in the parent pom instead of directly defining dependencies for all modules that might only be relevant to one. Each module includes just the relevant dependencies it needs.
  • Scala plugins are all opt-in to make it clear/sure that the common module does not have any odd Scala dependencies even for testing.
  • python-adapter still includes shading, but in a separate target used for the Python tests.

I did have to work through various exclusions to get the tests to work. I was able to remove the defining the jackson version for all testing purposes.

Questions to answer would be:

  • Is there any reason to actually publish a shaded module? Anything Spark related can either:
    • Java/Scala: use their Java build tool of choice to include the right dependencies in their subsequent jar
    • Python: use the --packages feature to automatically pull all required transitive dependencies
  • I don't know anything about Flink and if there is a use case for that needing a shaded module
  • What to do with the new edu.ucar module? It looks like it's in its own private repo, so for those behind a proxy that don't have access to every repo on the internet, this could cause dependency resolution to fail.

How was this patch tested?

Existing tests.

Did this PR include necessary documentation updates?

  • There may have to be documentation updates if this route is accepted, would need to look at what the current examples do.

@jiayuasu
Copy link
Member

jiayuasu commented Dec 18, 2022

@Kimahriman

Please feel free to correct me if I am wrong since I am not 100% confident about the current POM setting.

  1. The reason why we provide a shaded jar for python-adapter is that some Sedona users actually manually download the jar from ASF dist and manually upload it to its environment. Since this step has no maven involved, a shaded jar will be helpful for them. This happens to both Scala/Java/Python users in Spark and Flink.
  2. The majority of Flink users are using Java and pure SQL. Some of them might use Scala. Few of them are using Python.
  3. I can create something like edu-ucar wrapper to bring it to the Maven Central. But I don't want to put it to compile scope because (1) it depends on the Google guava which might have incompatible API across versions (2) it is quite large (3.3 MB itself) and only a few people use it.

BTW, in Sedona doc, we explain that what dependency should be called if you want to use Sedona jar with others: https://sedona.apache.org/setup/maven-coordinates/#use-sedona-and-third-party-jars-separately

Your new proposal

Based on your proposal, users will now interact with the following jars. Can you please confirm?

All Spark / Flink dependency must be provided in any cases.

Spark users:

  • Scala/Java user option 1: use Maven to manage dependency

    • sedona-core: with some compile scope dependencies but not shaded
    • sedona-sql: with some compile scope dependencies but not shaded
    • sedona-viz: with some compile scope dependencies but not shaded
      Users no longer need to manually add other dependencies (jts, geojson2jts, geotools-wrapper). geotools-wrapper is now a compile scope dependency.
  • Scala/Java user option 2: manually download jar and upload jar, no Maven

    • sedona-python-adapter: a shaded jar like what we have now
    • geotools-wrapper: a shaded jar that consists of all geotools dependency
  • Python user option: follow Scala/Java user option 2

Flink users:

  • Scala/Java user option 1: use Maven to manage dependency

    • sedona-core: with some compile scope dependencies but not shaded
    • sedona-flink: with some compile scope dependencies but not shaded
  • Scala/Java user option 2: manually download jar and upload jar, no Maven

    • sedona-python-adapter: a shaded jar like what we have now
    • geotools-wrapper: a shaded jar that consists of all geotools dependency

Final thoughts

It was a headache for me to get the POM right for publishing a release. Before we accept such a big change to the POM, we need to test by publishing SNAPSHOTS (https://sedona.apache.org/community/snapshot/) and try with the existing projects in example folder

@Kimahriman
Copy link
Contributor Author

Yeah I think that mostly covers it. We can keep a shaded jar for Spark and/or Flink for people who need it, and that can either be the python-adapter module or a new specifically shaded module since there's no reason the python-adapter needs to be shaded and it might be better to have an explicit module named "shaded". But for people who are using dependency management systems (either by building custom jars or using --packages in Spark), which should be what most people are doing and the common case, we use properly compile scoped dependencies where possible, while providing whatever shaded artifacts we need for the edge cases where users don't have the network connections available to load all the dependencies automatically.

Definitely will need to do a lot of test publishing with this to verify the resulting poms, but I hope it will start make the dependencies easier to manage and understand over time, and be less likely to have issues with conflicting scala versions and such.

I am not a pom or maven expert by any means (learned a lot of this as I went), so definitely looking for any feedback. This is how I see a lot of other projects structured, and I tried to pull a lot of things from how the Spark poms themselves were structured.

@umartin
Copy link
Contributor

umartin commented Dec 19, 2022

Nice work! Over all this looks good to me. Just a few notes:

  • Dependencies with compile scope are inherited. So python-adapter doesn't need to depend on sql, core and common. Only sql is needed. Same goes for the other modules. Maybe there is a reason why you had to repeat the dependencies that I'm missing. If so, ignore this comment :)
  • I think that separate shaded modules is a good idea. It would be nice to be able to opt out of shaded modules if you want to. In python and R projects you can just add the --packages flag to spark-submit or set the configuration property spark.jars.packages. Then spark will download any artifacts and their dependencies for you. Shading is never needed in Spark, regardless of which language you use, but it might be convenient depending on how you deploy your jobs. See https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

If you're not that familiar with maven, the command mvn dependency:tree is a nice way to verify transitive dependencies.
Verifying the shaded artifacts is harder. I would probably run jar -tf sedona-spark-shaded.jar | sort before and after this patch and diff the output.

@jiayuasu publishing snapshots sounds lika a good idea

@Kimahriman
Copy link
Contributor Author

Kimahriman commented Dec 19, 2022

  • Dependencies with compile scope are inherited. So python-adapter doesn't need to depend on sql, core and common. Only sql is needed. Same goes for the other modules. Maybe there is a reason why you had to repeat the dependencies that I'm missing. If so, ignore this comment :)

This was mostly intentional, I tried to include everything that was directly imported by the package to be more explicit, and not rely on expecting transitive dependencies to be there. I feel like that's the recommended practice? That being said I'm sure this isn't 100% true across the whole codebase right now. I feel like there are maven plugins you can use to check that potentially?

For the shaded modules do you think completely separate modules, like sedona-spark-shaded and sedona-flink-shaded? Or just like a classifier for the python adapter like this currently has I think of org.apache.sedona:sedona-python-adapter-3.0_2.12:1.3.1-incubating-SNAPSHOT:shaded

@umartin
Copy link
Contributor

umartin commented Dec 19, 2022

This was mostly intentional, I tried to include everything that was directly imported by the package to be more explicit, and not rely on expecting transitive dependencies to be there. I feel like that's the recommended practice? That being said I'm sure this isn't 100% true across the whole codebase right now. I feel like there are maven plugins you can use to check that potentially?

I think you are right. Sorry! I checked spark. They include dependencies explicitly. In https://github.com/apache/spark/blob/master/sql/core/pom.xml the dependency on sql-catalyst already pulls in core but they have listed it explicitly any way.

For the shaded modules do you think completely separate modules, like sedona-spark-shaded and sedona-flink-shaded? Or just like a classifier for the python adapter like this currently has I think of org.apache.sedona:sedona-python-adapter-3.0_2.12:1.3.1-incubating-SNAPSHOT:shaded

For shading I'm thinking separate modules. I don't know what the pros and cons are but it feels more natural to me. With qualifiers you probably have to build twice to get both the shaded and non-shaded jars.

@jiayuasu
Copy link
Member

@Kimahriman @umartin Thank you both for the great suggestions.

@Kimahriman Let's proceed with your proposal. In addition, let's create two completely separate modules sedona-spark-shaded (for both Scala 2.12 and Scala 2.13) and sedona-flink-shaded (only for Scala 2.12)

After this, Adam you can try to publish SNAPSHOTs to ASF repo. You are a Sedona PMC so you have the permission to upload SNAPSHOTs. I can provide some instruction and scripts for you to upload SNAPSHOTs.

@Kimahriman
Copy link
Contributor Author

Got an initial version of the shaded modules working. Need to see if tests still pass (mostly python ones, using the shaded spark module to run those now). After seeing how that works I'll look into publishing snapshots

@Kimahriman Kimahriman changed the title [WIP][SEDONA-212] Rework dependency management [SEDONA-212] Rework dependency management Dec 25, 2022
@Kimahriman
Copy link
Contributor Author

I think I have this in a good place and builds working, except for the examples. It looks like those don't use any local builds of sedona to build the examples, and just relies on a snapshot version being published? Is that right?

Ready for anyone to review, and I can publish a snapshot version whenever. Do I just need to add a distribution management section to the pom temporarily? I think I have all my creds setup correctly.

Only other thing would be updating docs, could try to do that here or in a separate PR.

@Kimahriman
Copy link
Contributor Author

Couriser dependencies for each module now:

common:

com.fasterxml.jackson.core:jackson-annotations:2.12.2:default
com.fasterxml.jackson.core:jackson-core:2.12.2:default
com.fasterxml.jackson.core:jackson-databind:2.12.2:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.locationtech.jts:jts-core:1.19.0:default
org.wololo:jts2geojson:0.16.1:default

core:

commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.wololo:jts2geojson:0.16.1:default

sql:

commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-sql-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.scala-lang.modules:scala-collection-compat_2.12:2.5.0:default
org.wololo:jts2geojson:0.16.1:default

viz:

commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-sql-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-viz-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.beryx:awt-color-factory:1.0.0:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.scala-lang.modules:scala-collection-compat_2.12:2.5.0:default
org.wololo:jts2geojson:0.16.1:default

python-adapter:

commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-python-adapter-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-sql-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.scala-lang.modules:scala-collection-compat_2.12:2.5.0:default
org.wololo:jts2geojson:0.16.1:default

flink:

com.fasterxml.jackson.core:jackson-annotations:2.12.2:default
com.fasterxml.jackson.core:jackson-core:2.12.2:default
com.fasterxml.jackson.core:jackson-databind:2.12.2:default
commons-lang:commons-lang:2.6:default
org.apache.sedona:sedona-common:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-core-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-flink_2.12:1.3.2-incubating-SNAPSHOT:default
org.apache.sedona:sedona-sql-3.0_2.12:1.3.2-incubating-SNAPSHOT:default
org.jheaps:jheaps:0.14:default
org.locationtech.jts:jts-core:1.19.0:default
org.scala-lang:scala-library:2.12.15:default
org.scala-lang.modules:scala-collection-compat_2.12:2.5.0:default
org.wololo:jts2geojson:0.16.1:default

spark-shaded:

org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.3.2-incubating-SNAPSHOT:default

flink-shaded:

org.apache.sedona:sedona-flink-shaded_2.12:1.3.2-incubating-SNAPSHOT:default

These should be what get included by using the --packages option in spark or by building your own fat jar

@@ -3,7 +3,7 @@ name: Scala and Java build
on:
push:
branches:
- master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change it to *. CI is supposed to only test the master branch and pull requests because all repos under ASF share 150 VMs sponsored by GitHub and we want to leave some resources to other projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always change this locally so that I can get the CI to run before making a PR, will change this back. Does that only apply to things running under the apache group? When it runs on my own GitHub actions does it use the same vms?

@@ -3,7 +3,7 @@ name: Python build
on:
push:
branches:
- master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Use master please.

@jiayuasu
Copy link
Member

@Kimahriman I have updated the example projects. If you pull the latest change, it will pass the CI.

I suppose you have read: https://sedona.apache.org/1.3.1-incubating/community/release-manager/

Now you can try to publish the snapshots: https://sedona.apache.org/1.3.1-incubating/community/release-manager/

Once you publish the snapshots, you can try them in the example projects by using sedona-1.3.2-incubating-SNAPSHOT as the version.

@jiayuasu jiayuasu added this to the sedona-1.4.0 milestone Dec 28, 2022
@Kimahriman
Copy link
Contributor Author

@Kimahriman I have updated the example projects. If you pull the latest change, it will pass the CI.

I suppose you have read: https://sedona.apache.org/1.3.1-incubating/community/release-manager/

Now you can try to publish the snapshots: https://sedona.apache.org/1.3.1-incubating/community/release-manager/

Once you publish the snapshots, you can try them in the example projects by using sedona-1.3.2-incubating-SNAPSHOT as the version.

Published 1.3.2-incubating-SNAPSHOT to the snapshot repo

@jiayuasu
Copy link
Member

@Kimahriman Did you try the SNAPSHOT in example projects?

@Kimahriman
Copy link
Contributor Author

@Kimahriman Did you try the SNAPSHOT in example projects?

Yeah I've been trying but running into some weird things that may just be issues with my local environment, will keep looking into this weekend

@Kimahriman
Copy link
Contributor Author

Had an old shaded jar in my spark jars folder still, examples are working off of the published snapshot versions now

Copy link
Member

@jiayuasu jiayuasu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to me, except the java.net repo.

pom.xml Outdated
<scope>test</scope>
</dependency>
</dependencies>
</dependencyManagement>
<repositories>
<repository>
<id>maven2-repository.dev.java.net</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last piece: I think java.net repo is no longer needed, right? I think it was used by some geotools dependencies. But now, after I remove this repo, the compilation still works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it

@douglasdennis
Copy link
Contributor

For what it's worth and from a user's perspective, I think this looks awesome and I love the idea of an explicitly named shaded jar over the python adapter one. I just noted several comments about only needing Spark for dependency management (like using --packages), and wanted to advocate for projects that cannot use Spark in that way.

The simplest example is a cluster that is disconnected from the internet in general and can only rely on the jars in the local system. In that case, a shaded jar is a very pleasant thing to have. Especially when the build pipeline is only setup for python or R.

@jiayuasu jiayuasu merged commit 91af711 into apache:master Jan 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants