Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark Job Launcher tool #9288

Merged
merged 10 commits into from
Sep 7, 2022
Merged

Conversation

KKcorps
Copy link
Contributor

@KKcorps KKcorps commented Aug 27, 2022

The users currently need to create the whole spark-submit command to run a spark job for batch ingestion. With so many plugins available inside pinot leads a lot of classpath errors and you also need to take care of various arguments based on the environment in which you are running. This new command in pinot-admin aims to simply this for the users.

Example

Previously if you had to run

export PINOT_VERSION=0.11.0-SNAPSHOT export PINOT_DISTRIBUTION_DIR=/Users/kharekartik/Documents/Developer/pinot/build/ spark-submit --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand --master yarn --deploy-mode client --jars ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar,${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar,${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar,${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3.2/pinot-batch-ingestion-spark-3.2-0.11.0-SNAPSHOT-shaded.jar local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar -jobSpecFile parquet_ingestion_spec_spark3_students.yml

but now you can now use

export SPARK_HOME=/usr/lib/spark/ bin/pinot-admin.sh LaunchSparkDataIngestionJob -jobSpecFile parquet_ingestion_spec_spark3_students.yml -pluginsToLoad pinot-parquet:pinot-s3 -master yarn

Additional Options

  • You can also mention any additional spark configurations using the -sparkConf option
    -sparkConf spark.executor.cores=3:num-executors=4

  • Users can also specify jars directly from S3/GCS instead of local disk for environments like EMR
    -pinotBaseDir s3://your-bucket/apache-pinot-0.11.0-SNAPSHOT

  • You can choose whether to run spark 2.x or 3.x with the following option (default is SPARK_3)
    -sparkVersion SPARK_2

@KKcorps KKcorps requested a review from xiangfu0 August 27, 2022 18:55
@codecov-commenter
Copy link

codecov-commenter commented Aug 27, 2022

Codecov Report

Merging #9288 (05ebd9d) into master (a5a83aa) will decrease coverage by 42.57%.
The diff coverage is 18.85%.

❗ Current head 05ebd9d differs from pull request most recent head 32fe741. Consider uploading reports for the commit 32fe741 to get more accurate results

@@              Coverage Diff              @@
##             master    #9288       +/-   ##
=============================================
- Coverage     68.66%   26.09%   -42.58%     
+ Complexity     4680       44     -4636     
=============================================
  Files          1859     1855        -4     
  Lines         99120    99278      +158     
  Branches      15075    15112       +37     
=============================================
- Hits          68062    25904    -42158     
- Misses        26174    70783    +44609     
+ Partials       4884     2591     -2293     
Flag Coverage Δ
integration1 26.09% <18.85%> (-0.13%) ⬇️
unittests1 ?
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...apache/pinot/broker/api/HttpRequesterIdentity.java 28.57% <0.00%> (-57.15%) ⬇️
...org/apache/pinot/broker/api/RequesterIdentity.java 50.00% <0.00%> (-50.00%) ⬇️
.../pinot/broker/api/resources/PinotBrokerLogger.java 0.00% <0.00%> (ø)
...mon/segment/generation/SegmentGenerationUtils.java 7.29% <0.00%> (-13.77%) ⬇️
...rg/apache/pinot/common/utils/LoggerFileServer.java 0.00% <0.00%> (ø)
...pache/pinot/common/utils/config/InstanceUtils.java 12.19% <0.00%> (-77.95%) ⬇️
...org/apache/pinot/common/utils/http/HttpClient.java 68.13% <ø> (ø)
...er/api/access/ZkBasicAuthAccessControlFactory.java 0.00% <ø> (ø)
...ache/pinot/controller/api/resources/Constants.java 21.05% <ø> (-21.06%) ⬇️
...er/api/resources/LLCSegmentCompletionHandlers.java 62.37% <ø> (+18.81%) ⬆️
... and 1365 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

<groupId>org.apache.spark</groupId>
<artifactId>spark-launcher_${scala.version}</artifactId>
<version>${spark.version}</version>
<exclusions>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have some minor concerns about adding dependencies on Spark here. But since it's only pinot-tools, it's not so important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dep is pretty basic. Only contains couple of classes and one more dependency. So including it isn't a main concern. Earlier I had included spark-core which was a major one and could have caused a lot of issues.

@@ -34,6 +34,8 @@
<properties>
<pinot.root>${basedir}/..</pinot.root>
<aws.version>2.14.28</aws.version>
<scala.version>2.12</scala.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you start pulling in Scala code, you need to ensure that every dependency that's also using Scala is using the same version. So I think this should go in the top-level pom.xml file. Also note that pinot-kafka and pinot-spark have dependencies on Scala 2.11, which I believe will cause runtime problems if they are on the classpath when the tool is being run (and it's using 2.12).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have actually excluded the scala code from plugin. It is just that the plugin requires it in name.

Copy link
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, it simplifies the Spark story a lot

@KKcorps
Copy link
Contributor Author

KKcorps commented Aug 29, 2022

It is failing in some cases like local environment but multi threaded. Working on fixing those post which we can merge.

// Kafka plugins need to be excluded as they contain scala dependencies which cause
// NoSuchMethodErrors with runtime spark.
// It is also fine to exclude Kafka plugins as they are not going to be used in batch ingestion in any case
@CommandLine.Option(names = {"-pluginsToExclude"}, defaultValue = "pinot-kafka-0.9:pinot-kafka-2.0", required =
Copy link
Contributor

@walterddr walterddr Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we already have a property file and a jobspec file. do we still want to support these commandline options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that can be done as well. My initial thought process when writing this was to not use ingestion spec at all in the command code and only use it inside the spark job. Reason being it puts the limitation of always providing ingestion spec as a local file path and not S3.

However, I did ended up using the spec because otherwise we can't load the PinotFS classes and find appropriate plugin jars in S3, GCS etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take this up as a seperate PR

@KKcorps KKcorps merged commit 06b76e6 into apache:master Sep 7, 2022
@KKcorps KKcorps added the release-notes Referenced by PRs that need attention when compiling the next release notes label Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants