[cpu] OOM with device_type = cuda when using lgb.cv(...) for parameters grid search #4952

denmoroz · 2022-01-15T20:15:02Z

Description

I am using lgb.cv with device_type = cuda to find the best parameters set on validation data using grid search.

LightGBM fails with "Out Of Memory" after few training runs on the same fixed dataset and different parameter sets that are not dependent on data (num_leaves, lambda_l1, lambda_l2 and some others).

Reproducible example

Let's take simple example as a starting point and update it a bit to emulate cross-validation over num_leaves using cuda device:

# coding: utf-8
from pathlib import Path

import pandas as pd
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

print('Loading data...')
# load or create your dataset
regression_example_dir = Path(__file__).absolute().parents[1] / 'regression'
df_train = pd.read_csv(str(regression_example_dir / 'regression.train'), header=None, sep='\t')
df_test = pd.read_csv(str(regression_example_dir / 'regression.test'), header=None, sep='\t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
    'device_type': 'cuda',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

for num_leaves in [31, 63, 255]:
    print('Starting training...')

    training_params = params.copy()
    training_params["num_leaves"] = num_leaves

    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=1000,
                    valid_sets=lgb_eval,
                    callbacks=[lgb.early_stopping(stopping_rounds=100)])

    print('Starting predicting...')
    # predict
    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
    # eval
    rmse_test = mean_squared_error(y_test, y_pred) ** 0.5
    print(f'The RMSE of prediction is: {rmse_test}')

During running this GPU memory usage is constantly growing after each for cycle iteration (157 Mb, 169Mb and 181Mb respectively on my GTX 1080Ti).

On the real-world data I am working with (millions of rows, the same code structure) situation is much more dramatic - it fails on the second iteration start with "Out of Memory" here. It seems like CUDA tensors for feature matrix are created multiple times (once per run?).

Environment info

Ubuntu Linux 16.04, CUDA 10.1, Nvidia driver 418.87.01

LightGBM version or commit hash:
3.3.2

Command(s) you used to install LightGBM:
pip install -U lightgbm==3.3.2 --install-option=--cuda

Additional Comments

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-01-17T03:51:19Z

Thanks for using LightGBM and for the thorough report!

Does the example code you provided also fail with "out of memory" errors? I noticed that it doesn't use lgb.cv() but the issue title says "when using lgb.cv()". Which of these are you asking for?

help reducing the memory footprint of code following the pattern you've provided
help reducing the memory footprint of lgb.cv()
both of those
something else

denmoroz · 2022-01-17T14:25:38Z

@jameslamb
Thanks for quick response!

Regarding but the issue title says "when using lgb.cv()": yeah, in my real production code I am using lgb.cv to perform cross-validation over various sets of params to find the optimal one.

I haven't posted it as it is quite massive (there are a lot of custom abstractions that creates validation folds, wraps custom metrics, uses distributed Optuna setup to yield promising parameters sets, custom callbacks, some other distributed code and so on) - it is basically a part of the huge framework.

Instead I've taken the simple example and modified it a bit to emulate very similar situation in a small piece of code. The point here is that I still observe memory growth at each iteration (so abstractions are not the primary cause), but from the logical perspective it shouldn't - the data is fixed (I do not apply any changes to Dataset) and after quick look at cuda_tree_learner I think it should reuse already allocated tensors on GPU. But in reality it seems like cuda_tree_learner still copies the whole dataset to GPU on each iteration (or do not properly release it when training is done).

If it will help, I can make input dataset in this toy example 1000x bigger and try to run it again - I am sure it will fail with OOM.

denmoroz · 2022-01-20T19:21:06Z

@jameslamb

Regarding this

If it will help, I can make input dataset in this toy example 1000x bigger and try to run it again - I am sure it will fail with OOM.

I had a free minute to check.

Indeed

# coding: utf-8
from pathlib import Path

import pandas as pd
import lightgbm as lgb

print('Loading data...')
# load or create your dataset
regression_example_dir = Path(__file__).absolute().parents[1] / 'regression'
df_train = pd.read_csv(str(regression_example_dir / 'regression.train'), header=None, sep='\t')
df_test = pd.read_csv(str(regression_example_dir / 'regression.test'), header=None, sep='\t')

df_train = pd.concat([df_train] * 5000, ignore_index=True)
df_test = pd.concat([df_test] * 5000, ignore_index=True)

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
    'device_type': 'cuda',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

for num_leaves in list(range(10, 1000, 10)):
    print('Starting training...')

    training_params = params.copy()
    training_params["num_leaves"] = num_leaves

    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=10,
                    valid_sets=[lgb_eval])

fails after 8 iterations on my GTX 1080Ti with

[LightGBM] [Fatal] [CUDA] out of memory /tmp/pip-install-enl8e16_/lightgbm_2591f42f1f7645cbbbbd8d4539df8eb3/compile/src/treelearner/cuda_tree_learner.cpp 333

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] out of memory /tmp/pip-install-enl8e16_/lightgbm_2591f42f1f7645cbbbbd8d4539df8eb3/compile/src/treelearner/cuda_tree_learner.cpp 333

So there is definitely memory-freeing issue somewhere afaiu.

As for me it is suspicious that CUDATreeLearner does not perform any GPU memory clean up in its destructor while GPUTreeLearner does.

UPD:
It seems like freeing memory in destructor solves the issue. No more crashes for the snippet I've posted above.

UPD-2:
Also checked that in my production setup with real-world data everything also works now as expected - no memory growth observed anymore.

denmoroz · 2022-02-21T08:30:16Z

Closing as PR containing fix is merged

jameslamb · 2022-02-21T23:49:27Z

Thanks so much for the help @denmoroz !

github-actions · 2023-08-23T00:20:01Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added the question label Jan 17, 2022

denmoroz mentioned this issue Jan 20, 2022

CUDATreeLearner: free GPU memory in destructor if any allocated #4963

Merged

denmoroz closed this as completed Feb 21, 2022

jameslamb changed the title ~~OOM with device_type = cuda when using lgb.cv(...) for parameters grid search~~ [cpu] OOM with device_type = cuda when using lgb.cv(...) for parameters grid search Feb 21, 2022

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cpu] OOM with device_type = cuda when using lgb.cv(...) for parameters grid search #4952

[cpu] OOM with device_type = cuda when using lgb.cv(...) for parameters grid search #4952

denmoroz commented Jan 15, 2022 •

edited

Loading

jameslamb commented Jan 17, 2022

denmoroz commented Jan 17, 2022 •

edited

Loading

denmoroz commented Jan 20, 2022 •

edited

Loading

denmoroz commented Feb 21, 2022

jameslamb commented Feb 21, 2022

github-actions bot commented Aug 23, 2023

[cpu] OOM with device_type = cuda when using lgb.cv(...) for parameters grid search #4952

[cpu] OOM with device_type = cuda when using lgb.cv(...) for parameters grid search #4952

Comments

denmoroz commented Jan 15, 2022 • edited Loading

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Jan 17, 2022

denmoroz commented Jan 17, 2022 • edited Loading

denmoroz commented Jan 20, 2022 • edited Loading

denmoroz commented Feb 21, 2022

jameslamb commented Feb 21, 2022

github-actions bot commented Aug 23, 2023

denmoroz commented Jan 15, 2022 •

edited

Loading

denmoroz commented Jan 17, 2022 •

edited

Loading

denmoroz commented Jan 20, 2022 •

edited

Loading