Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cpu] OOM with device_type = cuda when using lgb.cv(...) for parameters grid search #4952

Closed
denmoroz opened this issue Jan 15, 2022 · 6 comments
Labels

Comments

@denmoroz
Copy link
Contributor

denmoroz commented Jan 15, 2022

Description

I am using lgb.cv with device_type = cuda to find the best parameters set on validation data using grid search.

LightGBM fails with "Out Of Memory" after few training runs on the same fixed dataset and different parameter sets that are not dependent on data (num_leaves, lambda_l1, lambda_l2 and some others).

Reproducible example

Let's take simple example as a starting point and update it a bit to emulate cross-validation over num_leaves using cuda device:

# coding: utf-8
from pathlib import Path

import pandas as pd
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

print('Loading data...')
# load or create your dataset
regression_example_dir = Path(__file__).absolute().parents[1] / 'regression'
df_train = pd.read_csv(str(regression_example_dir / 'regression.train'), header=None, sep='\t')
df_test = pd.read_csv(str(regression_example_dir / 'regression.test'), header=None, sep='\t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
    'device_type': 'cuda',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

for num_leaves in [31, 63, 255]:
    print('Starting training...')

    training_params = params.copy()
    training_params["num_leaves"] = num_leaves

    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=1000,
                    valid_sets=lgb_eval,
                    callbacks=[lgb.early_stopping(stopping_rounds=100)])

    print('Starting predicting...')
    # predict
    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
    # eval
    rmse_test = mean_squared_error(y_test, y_pred) ** 0.5
    print(f'The RMSE of prediction is: {rmse_test}')

During running this GPU memory usage is constantly growing after each for cycle iteration (157 Mb, 169Mb and 181Mb respectively on my GTX 1080Ti).

On the real-world data I am working with (millions of rows, the same code structure) situation is much more dramatic - it fails on the second iteration start with "Out of Memory" here. It seems like CUDA tensors for feature matrix are created multiple times (once per run?).

Environment info

Ubuntu Linux 16.04, CUDA 10.1, Nvidia driver 418.87.01

LightGBM version or commit hash:
3.3.2

Command(s) you used to install LightGBM:
pip install -U lightgbm==3.3.2 --install-option=--cuda

Additional Comments

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM and for the thorough report!

Does the example code you provided also fail with "out of memory" errors? I noticed that it doesn't use lgb.cv() but the issue title says "when using lgb.cv()". Which of these are you asking for?

  • help reducing the memory footprint of code following the pattern you've provided
  • help reducing the memory footprint of lgb.cv()
  • both of those
  • something else

@denmoroz
Copy link
Contributor Author

denmoroz commented Jan 17, 2022

@jameslamb
Thanks for quick response!

Regarding but the issue title says "when using lgb.cv()": yeah, in my real production code I am using lgb.cv to perform cross-validation over various sets of params to find the optimal one.

I haven't posted it as it is quite massive (there are a lot of custom abstractions that creates validation folds, wraps custom metrics, uses distributed Optuna setup to yield promising parameters sets, custom callbacks, some other distributed code and so on) - it is basically a part of the huge framework.

Instead I've taken the simple example and modified it a bit to emulate very similar situation in a small piece of code. The point here is that I still observe memory growth at each iteration (so abstractions are not the primary cause), but from the logical perspective it shouldn't - the data is fixed (I do not apply any changes to Dataset) and after quick look at cuda_tree_learner I think it should reuse already allocated tensors on GPU. But in reality it seems like cuda_tree_learner still copies the whole dataset to GPU on each iteration (or do not properly release it when training is done).

If it will help, I can make input dataset in this toy example 1000x bigger and try to run it again - I am sure it will fail with OOM.

@denmoroz
Copy link
Contributor Author

denmoroz commented Jan 20, 2022

@jameslamb

Regarding this

If it will help, I can make input dataset in this toy example 1000x bigger and try to run it again - I am sure it will fail with OOM.

I had a free minute to check.

Indeed

# coding: utf-8
from pathlib import Path

import pandas as pd
import lightgbm as lgb

print('Loading data...')
# load or create your dataset
regression_example_dir = Path(__file__).absolute().parents[1] / 'regression'
df_train = pd.read_csv(str(regression_example_dir / 'regression.train'), header=None, sep='\t')
df_test = pd.read_csv(str(regression_example_dir / 'regression.test'), header=None, sep='\t')

df_train = pd.concat([df_train] * 5000, ignore_index=True)
df_test = pd.concat([df_test] * 5000, ignore_index=True)

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
    'device_type': 'cuda',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

for num_leaves in list(range(10, 1000, 10)):
    print('Starting training...')

    training_params = params.copy()
    training_params["num_leaves"] = num_leaves

    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=10,
                    valid_sets=[lgb_eval])

fails after 8 iterations on my GTX 1080Ti with

[LightGBM] [Fatal] [CUDA] out of memory /tmp/pip-install-enl8e16_/lightgbm_2591f42f1f7645cbbbbd8d4539df8eb3/compile/src/treelearner/cuda_tree_learner.cpp 333

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] out of memory /tmp/pip-install-enl8e16_/lightgbm_2591f42f1f7645cbbbbd8d4539df8eb3/compile/src/treelearner/cuda_tree_learner.cpp 333

So there is definitely memory-freeing issue somewhere afaiu.

As for me it is suspicious that CUDATreeLearner does not perform any GPU memory clean up in its destructor while GPUTreeLearner does.

UPD:
It seems like freeing memory in destructor solves the issue. No more crashes for the snippet I've posted above.

UPD-2:
Also checked that in my production setup with real-world data everything also works now as expected - no memory growth observed anymore.

@denmoroz
Copy link
Contributor Author

Closing as PR containing fix is merged

@jameslamb
Copy link
Collaborator

Thanks so much for the help @denmoroz !

@jameslamb jameslamb changed the title OOM with device_type = cuda when using lgb.cv(...) for parameters grid search [cpu] OOM with device_type = cuda when using lgb.cv(...) for parameters grid search Feb 21, 2022
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants