Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dask][macOS] distributed.nanny - WARNING - Restarting worker and shows no worker #4625

Closed
orcahmlee opened this issue Sep 24, 2021 · 13 comments
Assignees

Comments

@orcahmlee
Copy link

Description

I recently tried using lightgbm.dask with lightgbm==3.2.1 but I hit the error LightGBMError: Socket recv error, code: 54 randomly, detail as #4116 (comment).

Thanks for @jameslamb suggestion, I tried install the lightgbm from the source(54facc4) to verify whether or not I would keep hitting LightGBMError: Socket recv error, code: 54. However, I hit the new situation.

I ran a simple regression example on JupyterLab on macOS with lightgbm==3.2.1 and lightgbm==3.2.1.99(54facc4). There are not rich logs I can paste it, but I will describe as clear as possible what is happening.

3.2.1

  1. Summit the job from Jupyter client
  2. Progress shows: array -> dict -> _train_part
  3. Workers memory shows: only incress ~60 MiB per worker
  4. Receive the resulte on Jupyter client

圖片

3.2.1.99(54facc4)

  1. Summit the job from Jupyter client
  2. Progress shows: array -> dict -> _train_part
  3. Original workers were killed for some reason
    • I'm not sure is ran out of memory because the memory only incress ~60 MiB per worker
    1. Progress shows: deserialized_find_n_ports -> find_n_ports -> array -> dict
  4. Dask Task Graph shows: no-worker
  5. Jupyter client shows: distributed.nanny - WARNING - Restarting worker and wait forever

圖片

圖片

Reproducible example

import dask.array as da
from distributed import Client, LocalCluster
from sklearn.datasets import make_regression

import lightgbm as lgb


cluster = LocalCluster(n_workers=2, dashboard_address=':9797')
client = Client(cluster)

X, y = make_regression(n_samples=7000, n_features=50)
dX = da.from_array(X, chunks=(100, 50))
dy = da.from_array(y, chunks=(100,))

print("beginning training")

dask_model = lgb.DaskLGBMRegressor(n_estimators=10)
dask_model.fit(dX, dy)
assert dask_model.fitted_

print("done training")

Output from Jupyter

/Users/andrew/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/dask.py:525: UserWarning: Parameter n_jobs will be ignored.
  _log_warning(f"Parameter {param_alias} will be ignored.")

Finding random open ports for workers
[LightGBM] [Info] Trying to bind port 51431...
[LightGBM] [Info] Binding port 51431 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 338 milliseconds
[LightGBM] [Info] Trying to bind port 51432...
[LightGBM] [Info] Binding port 51432 succeeded
[LightGBM] [Info] Listening...

distributed.nanny - WARNING - Restarting worker

[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4

distributed.nanny - WARNING - Restarting worker

Environment info

macOS 11.5.2
conda 4.10.3
python                    3.8.10          h0e5c897_0_cpython    conda-forge
dask                      2021.9.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.9.0           pyhd8ed1ab_0    conda-forge
dask-kubernetes           2021.3.1           pyhd8ed1ab_0    conda-forge
distributed               2021.9.0         py38h50d1736_0    conda-forge
lightgbm                  3.2.1.99                 pypi_0    pypi

LightGBM version or commit hash: 54facc4

Command(s) you used to install LightGBM:

git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM/python-package
python setup.py install

Additional Comments

None

@jmoralez
Copy link
Collaborator

I just ran the training 100 times in a row. Does the hang happen every time for you?

@orcahmlee
Copy link
Author

I just ran the training 100 times in a row. Does the hang happen every time for you?

For the version 3.2.1.99(54facc4), yes. It happened every time when I ran.

@jmoralez
Copy link
Collaborator

@orcahmlee are you using your system's libomp? We've seen some problems with version 12.0.0 on MacOS

@orcahmlee
Copy link
Author

@orcahmlee are you using your system's libomp? We've seen some problems with version 12.0.0 on MacOS

I follow the instructions to install the building tools (build-from-sources, apple-clang).

I installed cmake and libomp using Homebrew then builded from source using setup.py.

brew install cmake
brew install libomp

git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM/python-package
python setup.py install

The information of gcc and libomp show as below. Hope it is useful.

$ gcc -v
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 12.0.5 (clang-1205.0.22.9)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
$ brew info libomp
libomp: stable 12.0.1 (bottled)
LLVM's OpenMP runtime library
https://openmp.llvm.org/
/usr/local/Cellar/libomp/12.0.1 (9 files, 1.5MB) *
  Poured from bottle on 2021-09-17 at 11:43:33
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/libomp.rb
License: MIT
==> Dependencies
Build: cmake ✘
==> Analytics
install: 72,407 (30 days), 257,039 (90 days), 1,153,144 (365 days)
install-on-request: 9,466 (30 days), 31,639 (90 days), 136,392 (365 days)
build-error: 0 (30 days)

@jmoralez
Copy link
Collaborator

jmoralez commented Oct 1, 2021

@orcahmlee I'm pretty sure libomp 12 is the problem here, it can cause segfaults (#4632, #4229) so that's probably what's killing the workers. Can you try downgrading it?

@orcahmlee
Copy link
Author

orcahmlee commented Oct 4, 2021

@orcahmlee I'm pretty sure libomp 12 is the problem here, it can cause segfaults (#4632, #4229) so that's probably what's killing the workers. Can you try downgrading it?

Thanks for this inoframtion(#4229), it's let me allow to downgraded to limomp 11.

Since I downgraded to libomp 11, it's going well on macOS. The situation I described above never shows again.

BTW, I just saw Homebrew upgrade the libomp to 13(https://github.com/Homebrew/homebrew-core/blob/679923b4eb48a8dc7ecc1f05d06063cd79b3fc00/Formula/libomp.rb), and I tried to run the same example code using libomp 13, but it does work!

In short, in this case on my machine:

  • libomp 11.1.0: Work
  • libomp 12.0.1: DOES NOT Work
  • libomp 13.0.0: DOES NOT Work

@jmoralez
Copy link
Collaborator

jmoralez commented Oct 4, 2021

Just to be sure, it doesn't work with libomp 13?

@orcahmlee
Copy link
Author

Just to be sure, it doesn't work with libomp 13?

No, it doesn't work with libomp 13(installed from Homebrew) when I ran the same example code.

This time I didn't see the distributed.nanny - WARNING - Restarting worker and the workers didn't be killed. The _train_part task just hanging there forever.

  1. Summit the job from Jupyter client
  2. Progress shows: array -> dict -> _train_part
  3. Hanging forever

Output from Jupyter
Screen Shot 2021-10-05 at 09 37 19

Diagnostics
Screen Shot 2021-10-05 at 09 36 50

@jameslamb
Copy link
Collaborator

Thanks very much for working with us on this. I think what you found in #4625 (comment) is consistent with the ongoing investigation happening over in #4229 (comment)... using LightGBM with libomp>=12.0.0 may lead to issues, so for now please try to stay on llbomp 11.x.

You can subscribe to #4229 to track and contribute to the investigation of this issue.

@orcahmlee
Copy link
Author

Thanks very much for working with us on this. I think what you found in #4625 (comment) is consistent with the ongoing investigation happening over in #4229 (comment)... using LightGBM with libomp>=12.0.0 may lead to issues, so for now please try to stay on llbomp 11.x.

You can subscribe to #4229 to track and contribute to the investigation of this issue.

Thanks so much.

@jameslamb
Copy link
Collaborator

It seems like this issue can be closed, since the root cause was the ongoing issues with newer versions of libomp.

Thanks very much for the thorough reports @orcahmlee !

@orcahmlee
Copy link
Author

It seems like this issue can be closed, since the root cause was the ongoing issues with newer versions of libomp.

Thanks very much for the thorough reports @orcahmlee !

Thanks for your contribution.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants