[BUG] When spark.executor.instances > 1, the JVM always crashes. #2302

hjr1998 · 2024-10-15T06:48:05Z

SynapseML version

1.07

System information

Language version (e.g. python 3.8, scala 2.12):3.8
Spark Version (e.g. 3.2.3):3.3.2
Spark Platform (e.g. Synapse, Databricks):

Describe the problem

I am training a classifier, and the JVM always crashes when spark.executor.instances > 1, but it works fine when spark.executor.instances = 1. Can anyone help me with this issue?

Code to reproduce issue

for col in vecCols:
train = train.withColumn(col, train[col].cast(DoubleType()))
train = train.withColumn(labelCol, train[labelCol].cast(IntegerType()))
assembler = VectorAssembler(inputCols=vecCols, outputCol="features", handleInvalid="keep")
pipeline = Pipeline(stages=[assembler])
train = pipeline.fit(train).transform(train)

classifier = LightGBMClassifier(featuresCol="features", categoricalSlotNames=cateCols, featuresShapCol="importances", labelCol=labelCol, verbosity=10,executionMode='streaming', useSingleDatasetMode=True)

Other info / logs

24/10/15 14:26:29 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 2146 records.
24/10/15 14:26:29 INFO InternalParquetRecordReader: at row 0. reading next block
24/10/15 14:26:29 INFO InternalParquetRecordReader: block read in memory in 2 ms. row count = 2146
24/10/15 14:26:30 INFO StreamingPartitionTask: done with data preparation on partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Helper task 29, partition 1 finished processing rows
24/10/15 14:26:30 INFO StreamingPartitionTask: Beginning cleanup for partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Done with cleanup for partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Getting final training Dataset for partition 5.
24/10/15 14:26:30 INFO Executor: Finished task 1.0 in stage 8.0 (TID 29). 1789 bytes result sent to driver
24/10/15 14:26:30 INFO StreamingPartitionTask: Creating LightGBM Booster for partition 5, task 33
24/10/15 14:26:30 INFO StreamingPartitionTask: Beginning training on LightGBM Booster for task 33, partition 5
24/10/15 14:26:30 INFO StreamingPartitionTask: LightGBM task starting iteration 0
[LightGBM] [Info] Number of positive: 61700, number of negative: 1063514
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.844580
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.479102
[LightGBM] [Debug] init for col-wise cost 0.115891 seconds, init for row-wise cost 0.390814 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.161628 seconds.
You can set force_row_wise=true to remove the overhead.
And if memory is not enough, you can set force_col_wise=true.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 10088
[LightGBM] [Info] Number of data points in the train set: 580970, number of used features: 83
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.054834 -> initscore=-2.847050
[LightGBM] [Info] Start training from score -2.847050

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fd300c484db, pid=3104731, tid=0x00007fd30bc28700

JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 )

Problematic frame:

C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree, int, int, int*, bool)+0xf7b

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/data/disk03/hadoop/yarn/local/usercache/appcache/application_1715260252878_48344/container_e25_1715260252878_48344_02_000003/hs_err_pid3104731.log

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

What component(s) does this bug affect?

What language(s) does this bug affect?

language/scala: Scala source code
language/python: Pyspark APIs
language/r: R APIs
language/csharp: .NET APIs
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/synapse: Azure Synapse integrations
integrations/azureml: Azure ML integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

hjr1998 added the bug label Oct 15, 2024

github-actions bot added the triage label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] When spark.executor.instances > 1, the JVM always crashes. #2302

[BUG] When spark.executor.instances > 1, the JVM always crashes. #2302

hjr1998 commented Oct 15, 2024 •

edited

Loading

[BUG] When spark.executor.instances > 1, the JVM always crashes. #2302

[BUG] When spark.executor.instances > 1, the JVM always crashes. #2302

Comments

hjr1998 commented Oct 15, 2024 • edited Loading

SynapseML version

System information

Describe the problem

Code to reproduce issue

Other info / logs

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fd300c484db, pid=3104731, tid=0x00007fd30bc28700

JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 )

Problematic frame:

C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/data/disk03/hadoop/yarn/local/usercache/appcache/application_1715260252878_48344/container_e25_1715260252878_48344_02_000003/hs_err_pid3104731.log

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

hjr1998 commented Oct 15, 2024 •

edited

Loading

C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree, int, int, int*, bool)+0xf7b