[BUG] LightGBMRanker: `groupCol` not recognized - LightGBM sees all records in the DataFrame as part of 1 query/group #2290

Vonatzki · 2024-09-20T09:06:19Z

SynapseML version

com.microsoft.azure:synapseml_2.12:1.0.5

System information

Language version: Python 3.12.2, Scala 2.12
Spark Version: 3.5.2
Spark Platform: Local (Using Macbook Pro M2 w/ 12 cores 18gb RAM)

Describe the problem

I am encountering an issue with LightGBMRanker, it seems that the model does not recognize that the PySpark DataFrame I am using for training is composed of many queries/groups.

In the native version of LightGBM, there is a parameter called group where you will specify an array-like sequence that indicates the number of sample per query/group, something like [10,20,30] where the sum of this array is the total number of samples. In my case, there are 31,674 records in my dataset.

Wondering how synapseml does this under the hood given that one should only indicate the groupCol and nothing else.

As shown in the error log, it seems the LightGBM model was not given any knowledge about how the records are grouped and thus complaining about all observation being part of a same query.

Code to reproduce issue

from synapse.ml.lightgbm import LightGBMRanker
from pyspark.ml.feature import VectorAssembler

# `train` contains `query_id` which indicates how each record is grouped
train_with_vec = spark.read.parquet("my_ranking_dataset.parquet")


vec_assembler = VectorAssembler(inputCols=feature_cols, outputCol="features", handleInvalid="keep")
train_with_vec = vec_assembler.transform(train)
train_with_vec = train_with_vec.withColumn("labels", (30 * F.col('relevance')).astype('int'))

features_col = "features"
query_col = "query_id"
label_col = "labels"

lgbm_ranker = LightGBMRanker(
    labelCol=label_col,
    featuresCol=features_col,
    groupCol=query_col, # As shown here, I indicated `query_id` as the groupCol parameter value.
    predictionCol="preds",
    leafPredictionCol="leafPreds",
    featuresShapCol="importances",
    repartitionByGroupingColumn=True,
    numLeaves=32,
    numIterations=200,
    evalAt=[1, 3, 5],
    metric="ndcg",
    useBarrierExecutionMode=True,
    verbosity=4,
)

lgbm_ranker.fit(
    train_with_vec
    .join(
        train_with_vec
        .select('query_id')
        .distinct()
        .sample(0.001),
        on='query_id',
        how='inner',
    )
)

Other info / logs

[LightGBM] [Info] Saving data reference to binary buffer
[Stage 71:>                                                         (0 + 8) [/](https://file+.vscode-resource.vscode-cdn.net/) 8]
[LightGBM] [Info] Loaded reference dataset: 129 features, 31674 num_data
[LightGBM] [Fatal] Number of rows 31674 exceeds upper limit of 10000 for a query
24/09/20 16:53:52 WARN StreamingPartitionTask: LightGBM reached early termination on one task, stopping training on task. This message should rarely occur. Inner exception: java.lang.Exception: Booster call failed in LightGBM with error: Number of rows 31674 exceeds upper limit of 10000 for a query
[LightGBM] [Warning] Unknown parameter: max_position
[LightGBM] [Warning] Unknown parameter: max_position
[LightGBM] [Fatal] Number of rows 31674 exceeds upper limit of 10000 for a query
[LightGBM] [Warning] Unknown parameter: max_position
[LightGBM] [Fatal] Number of rows 31674 exceeds upper limit of 10000 for a query
24/09/20 16:53:53 ERROR Executor: Exception in task 0.0 in stage 71.0 (TID 2052)
java.lang.Exception: Booster call failed in LightGBM with error: Number of rows 31674 exceeds upper limit of 10000 for a query
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:18)
	at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler$lzycompute(LightGBMBooster.scala:242)
	at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler(LightGBMBooster.scala:232)
	at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.freeNativeMemory(LightGBMBooster.scala:493)
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.finalizeDatasetAndTrain(BasePartitionTask.scala:263)
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:152)
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
...

What component(s) does this bug affect?

What language(s) does this bug affect?

language/scala: Scala source code
language/python: Pyspark APIs
language/r: R APIs
language/csharp: .NET APIs
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/synapse: Azure Synapse integrations
integrations/azureml: Azure ML integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

Vonatzki added the bug label Sep 20, 2024

github-actions bot added the triage label Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] LightGBMRanker: `groupCol` not recognized - LightGBM sees all records in the DataFrame as part of 1 query/group #2290

[BUG] LightGBMRanker: `groupCol` not recognized - LightGBM sees all records in the DataFrame as part of 1 query/group #2290

Vonatzki commented Sep 20, 2024 •

edited

Loading

[BUG] LightGBMRanker: groupCol not recognized - LightGBM sees all records in the DataFrame as part of 1 query/group #2290

[BUG] LightGBMRanker: groupCol not recognized - LightGBM sees all records in the DataFrame as part of 1 query/group #2290

Comments

Vonatzki commented Sep 20, 2024 • edited Loading

SynapseML version

System information

Describe the problem

Code to reproduce issue

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

[BUG] LightGBMRanker: `groupCol` not recognized - LightGBM sees all records in the DataFrame as part of 1 query/group #2290

[BUG] LightGBMRanker: `groupCol` not recognized - LightGBM sees all records in the DataFrame as part of 1 query/group #2290

Vonatzki commented Sep 20, 2024 •

edited

Loading