Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightgbm.dask hangs after worker restarting #5920

Closed
kunshouout opened this issue Jun 10, 2023 · 3 comments
Closed

lightgbm.dask hangs after worker restarting #5920

kunshouout opened this issue Jun 10, 2023 · 3 comments

Comments

@kunshouout
Copy link

Description

The code for the dask workers

from dask.distributed import Client, LocalCluster
cluster = LocalCluster("127.0.0.1:8786", n_workers=2, threads_per_worker=4,memory_limit="28GB")  
client = Client(cluster)  

the code to train lightgbm model is very simple

gbm = DaskLGBMRegressor(**kwargs, client=client)
model = gbm.fit(ds_l["train"][0],
                           ds_l["train"][1],
                           eval_set=[ds_l["valid"]],
                           eval_names=["valid"],
)

After running for about 30 minutes, the program hangs. The logs are:

[LightGBM] [Info] Trying to bind port 52289...
[LightGBM] [Info] Binding port 52289 succeeded
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Info] Listening...
[LightGBM] [Info] Trying to bind port 51255...
[LightGBM] [Info] Binding port 51255 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Info] Local rank: 0, total number of machines: 2
2023-06-09 17:52:34,882 - distributed.worker_memory - WARNING - Worker is at 90% memory usage. Pausing worker.  Process memory: 23.58 GiB -- Worker memory limit: 26.08 GiB
2023-06-09 17:52:34,965 - distributed.worker_memory - WARNING - Worker is at 90% memory usage. Pausing worker.  Process memory: 23.53 GiB -- Worker memory limit: 26.08 GiB
2023-06-09 17:52:35,440 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:42781 (pid=2484830) exceeded 95% memory budget. Restarting...
2023-06-09 17:52:35,541 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:44669 (pid=2484828) exceeded 95% memory budget. Restarting...
2023-06-09 17:52:35,884 - distributed.nanny - WARNING - Restarting worker
2023-06-09 17:52:35,999 - distributed.nanny - WARNING - Restarting worker
Finding random open ports for workers

and the dashboard looks like this
image

From the dashboard, it looks like the two workers have been successfully restarted, and the memory is now within the limit. Why can't the training proceed further?

Environment info

lightgbm: 3.3.5
dask: 2022.11.1
python: 3.8.10
OS: ubuntu 20.04

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

Unfortunately, LightGBM distributed training is not currently resilient to workers being lost during training. See #3775 for some details on that.

It's a feature we'd love to add in the future, so if you are familiar with C++, Python, TCP, and collective communication patterns we'd welcome contributions.

Otherwise, you will just have to ensure your workers have sufficient memory to survive the training process, and subscribe to #3775 to be notified if/when it's addressed.

@github-actions
Copy link

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants