-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-212] Rework dependency management #735
Conversation
Please feel free to correct me if I am wrong since I am not 100% confident about the current POM setting.
BTW, in Sedona doc, we explain that what dependency should be called if you want to use Sedona jar with others: https://sedona.apache.org/setup/maven-coordinates/#use-sedona-and-third-party-jars-separately Your new proposalBased on your proposal, users will now interact with the following jars. Can you please confirm? All Spark / Flink dependency must be Spark users:
Flink users:
Final thoughtsIt was a headache for me to get the POM right for publishing a release. Before we accept such a big change to the POM, we need to test by publishing SNAPSHOTS (https://sedona.apache.org/community/snapshot/) and try with the existing projects in |
Yeah I think that mostly covers it. We can keep a shaded jar for Spark and/or Flink for people who need it, and that can either be the python-adapter module or a new specifically shaded module since there's no reason the python-adapter needs to be shaded and it might be better to have an explicit module named "shaded". But for people who are using dependency management systems (either by building custom jars or using Definitely will need to do a lot of test publishing with this to verify the resulting poms, but I hope it will start make the dependencies easier to manage and understand over time, and be less likely to have issues with conflicting scala versions and such. I am not a pom or maven expert by any means (learned a lot of this as I went), so definitely looking for any feedback. This is how I see a lot of other projects structured, and I tried to pull a lot of things from how the Spark poms themselves were structured. |
Nice work! Over all this looks good to me. Just a few notes:
If you're not that familiar with maven, the command @jiayuasu publishing snapshots sounds lika a good idea |
This was mostly intentional, I tried to include everything that was directly imported by the package to be more explicit, and not rely on expecting transitive dependencies to be there. I feel like that's the recommended practice? That being said I'm sure this isn't 100% true across the whole codebase right now. I feel like there are maven plugins you can use to check that potentially? For the shaded modules do you think completely separate modules, like |
I think you are right. Sorry! I checked spark. They include dependencies explicitly. In https://github.com/apache/spark/blob/master/sql/core/pom.xml the dependency on sql-catalyst already pulls in core but they have listed it explicitly any way.
For shading I'm thinking separate modules. I don't know what the pros and cons are but it feels more natural to me. With qualifiers you probably have to build twice to get both the shaded and non-shaded jars. |
@Kimahriman @umartin Thank you both for the great suggestions. @Kimahriman Let's proceed with your proposal. In addition, let's create two completely separate modules After this, Adam you can try to publish SNAPSHOTs to ASF repo. You are a Sedona PMC so you have the permission to upload SNAPSHOTs. I can provide some instruction and scripts for you to upload SNAPSHOTs. |
Got an initial version of the shaded modules working. Need to see if tests still pass (mostly python ones, using the shaded spark module to run those now). After seeing how that works I'll look into publishing snapshots |
I think I have this in a good place and builds working, except for the examples. It looks like those don't use any local builds of sedona to build the examples, and just relies on a snapshot version being published? Is that right? Ready for anyone to review, and I can publish a snapshot version whenever. Do I just need to add a distribution management section to the pom temporarily? I think I have all my creds setup correctly. Only other thing would be updating docs, could try to do that here or in a separate PR. |
Couriser dependencies for each module now: common:
core:
sql:
viz:
python-adapter:
flink:
spark-shaded:
flink-shaded:
These should be what get included by using the |
.github/workflows/java.yml
Outdated
@@ -3,7 +3,7 @@ name: Scala and Java build | |||
on: | |||
push: | |||
branches: | |||
- master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't change it to *
. CI is supposed to only test the master branch and pull requests because all repos under ASF share 150 VMs sponsored by GitHub and we want to leave some resources to other projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always change this locally so that I can get the CI to run before making a PR, will change this back. Does that only apply to things running under the apache group? When it runs on my own GitHub actions does it use the same vms?
.github/workflows/python.yml
Outdated
@@ -3,7 +3,7 @@ name: Python build | |||
on: | |||
push: | |||
branches: | |||
- master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Use master
please.
@Kimahriman I have updated the example projects. If you pull the latest change, it will pass the CI. I suppose you have read: https://sedona.apache.org/1.3.1-incubating/community/release-manager/ Now you can try to publish the snapshots: https://sedona.apache.org/1.3.1-incubating/community/release-manager/ Once you publish the snapshots, you can try them in the example projects by using sedona-1.3.2-incubating-SNAPSHOT as the version. |
Published 1.3.2-incubating-SNAPSHOT to the snapshot repo |
@Kimahriman Did you try the SNAPSHOT in example projects? |
Yeah I've been trying but running into some weird things that may just be issues with my local environment, will keep looking into this weekend |
Had an old shaded jar in my spark jars folder still, examples are working off of the published snapshot versions now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good to me, except the java.net repo.
pom.xml
Outdated
<scope>test</scope> | ||
</dependency> | ||
</dependencies> | ||
</dependencyManagement> | ||
<repositories> | ||
<repository> | ||
<id>maven2-repository.dev.java.net</id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last piece: I think java.net repo is no longer needed, right? I think it was used by some geotools dependencies. But now, after I remove this repo, the compilation still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it
For what it's worth and from a user's perspective, I think this looks awesome and I love the idea of an explicitly named shaded jar over the python adapter one. I just noted several comments about only needing Spark for dependency management (like using The simplest example is a cluster that is disconnected from the internet in general and can only rely on the jars in the local system. In that case, a shaded jar is a very pleasant thing to have. Especially when the build pipeline is only setup for python or R. |
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
[SEDONA-XXX] my subject
.What changes were proposed in this PR?
Using the problem of shading to rethink how dependencies are managed. Definitely trying to start a discussion and this is a WIP, finally got the tests to pass.
The general way things are setup now is "everything is a provided dependency, except for python-adapter where everything is shaded". This presents a few problems mentioned in the ticket:
sql
module, but without knowing what sub-dependencies I need, I just have to default to the python-adapter module so I have the dependencies I need.--packages
option. This is a bug due to theresolved-pom-maven-plugin
and themaven-shade-plugin
not playing nice together, so thedependency-reduced-pom
doesn't end up as the actual pom for the module.Because of all that, I'm proposing a few changes in this PR:
dependencyManagement
section in the parent pom instead of directly defining dependencies for all modules that might only be relevant to one. Each module includes just the relevant dependencies it needs.I did have to work through various exclusions to get the tests to work. I was able to remove the defining the jackson version for all testing purposes.
Questions to answer would be:
--packages
feature to automatically pull all required transitive dependenciesHow was this patch tested?
Existing tests.
Did this PR include necessary documentation updates?