Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] When spark.executor.instances > 1, the JVM always crashes. #2302

Open
2 of 19 tasks
hjr1998 opened this issue Oct 15, 2024 · 0 comments
Open
2 of 19 tasks

[BUG] When spark.executor.instances > 1, the JVM always crashes. #2302

hjr1998 opened this issue Oct 15, 2024 · 0 comments

Comments

@hjr1998
Copy link

hjr1998 commented Oct 15, 2024

SynapseML version

1.07

System information

  • Language version (e.g. python 3.8, scala 2.12):3.8
  • Spark Version (e.g. 3.2.3):3.3.2
  • Spark Platform (e.g. Synapse, Databricks):

Describe the problem

I am training a classifier, and the JVM always crashes when spark.executor.instances > 1, but it works fine when spark.executor.instances = 1. Can anyone help me with this issue?

Code to reproduce issue

for col in vecCols:
train = train.withColumn(col, train[col].cast(DoubleType()))
train = train.withColumn(labelCol, train[labelCol].cast(IntegerType()))
assembler = VectorAssembler(inputCols=vecCols, outputCol="features", handleInvalid="keep")
pipeline = Pipeline(stages=[assembler])
train = pipeline.fit(train).transform(train)

classifier = LightGBMClassifier(featuresCol="features", categoricalSlotNames=cateCols, featuresShapCol="importances", labelCol=labelCol, verbosity=10,executionMode='streaming', useSingleDatasetMode=True)

Other info / logs

24/10/15 14:26:29 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 2146 records.
24/10/15 14:26:29 INFO InternalParquetRecordReader: at row 0. reading next block
24/10/15 14:26:29 INFO InternalParquetRecordReader: block read in memory in 2 ms. row count = 2146
24/10/15 14:26:30 INFO StreamingPartitionTask: done with data preparation on partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Helper task 29, partition 1 finished processing rows
24/10/15 14:26:30 INFO StreamingPartitionTask: Beginning cleanup for partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Done with cleanup for partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Getting final training Dataset for partition 5.
24/10/15 14:26:30 INFO Executor: Finished task 1.0 in stage 8.0 (TID 29). 1789 bytes result sent to driver
24/10/15 14:26:30 INFO StreamingPartitionTask: Creating LightGBM Booster for partition 5, task 33
24/10/15 14:26:30 INFO StreamingPartitionTask: Beginning training on LightGBM Booster for task 33, partition 5
24/10/15 14:26:30 INFO StreamingPartitionTask: LightGBM task starting iteration 0
[LightGBM] [Info] Number of positive: 61700, number of negative: 1063514
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.844580
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.479102
[LightGBM] [Debug] init for col-wise cost 0.115891 seconds, init for row-wise cost 0.390814 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.161628 seconds.
You can set force_row_wise=true to remove the overhead.
And if memory is not enough, you can set force_col_wise=true.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 10088
[LightGBM] [Info] Number of data points in the train set: 580970, number of used features: 83
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.054834 -> initscore=-2.847050
[LightGBM] [Info] Start training from score -2.847050

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fd300c484db, pid=3104731, tid=0x00007fd30bc28700

JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 )

Problematic frame:

C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/data/disk03/hadoop/yarn/local/usercache/appcache/application_1715260252878_48344/container_e25_1715260252878_48344_02_000003/hs_err_pid3104731.log

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations
@hjr1998 hjr1998 added the bug label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant