-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cpu] OOM with device_type = cuda when using lgb.cv(...) for parameters grid search #4952
Comments
Thanks for using LightGBM and for the thorough report! Does the example code you provided also fail with "out of memory" errors? I noticed that it doesn't use
|
@jameslamb Regarding I haven't posted it as it is quite massive (there are a lot of custom abstractions that creates validation folds, wraps custom metrics, uses distributed Optuna setup to yield promising parameters sets, custom callbacks, some other distributed code and so on) - it is basically a part of the huge framework. Instead I've taken the simple example and modified it a bit to emulate very similar situation in a small piece of code. The point here is that I still observe memory growth at each iteration (so abstractions are not the primary cause), but from the logical perspective it shouldn't - the data is fixed (I do not apply any changes to Dataset) and after quick look at If it will help, I can make input dataset in this toy example 1000x bigger and try to run it again - I am sure it will fail with OOM. |
Regarding this
I had a free minute to check. Indeed
fails after 8 iterations on my GTX 1080Ti with
So there is definitely memory-freeing issue somewhere afaiu. As for me it is suspicious that CUDATreeLearner does not perform any GPU memory clean up in its destructor while GPUTreeLearner does. UPD: UPD-2: |
Closing as PR containing fix is merged |
Thanks so much for the help @denmoroz ! |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Description
I am using
lgb.cv
withdevice_type = cuda
to find the best parameters set on validation data using grid search.LightGBM fails with "Out Of Memory" after few training runs on the same fixed dataset and different parameter sets that are not dependent on data (
num_leaves
,lambda_l1
,lambda_l2
and some others).Reproducible example
Let's take simple example as a starting point and update it a bit to emulate cross-validation over
num_leaves
usingcuda
device:During running this GPU memory usage is constantly growing after each
for
cycle iteration (157 Mb, 169Mb and 181Mb respectively on my GTX 1080Ti).On the real-world data I am working with (millions of rows, the same code structure) situation is much more dramatic - it fails on the second iteration start with "Out of Memory" here. It seems like CUDA tensors for feature matrix are created multiple times (once per run?).
Environment info
Ubuntu Linux 16.04, CUDA 10.1, Nvidia driver 418.87.01
LightGBM version or commit hash:
3.3.2
Command(s) you used to install LightGBM:
pip install -U lightgbm==3.3.2 --install-option=--cuda
Additional Comments
The text was updated successfully, but these errors were encountered: