-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable NVML monitoring for WSL #6119
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall @charlesbluca , I've added a few suggestions.
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @charlesbluca !
@quasiben @jakirkham could one of you take a look/merge? |
Looks good. I'll plan to merge once CI finishes |
Unit Test Results 16 files ± 0 16 suites ±0 7h 37m 35s ⏱️ - 7m 48s For more details on these failures, see this check. Results for commit 5c4e74a. ± Comparison against base commit cdbb426. ♻️ This comment has been updated with latest results. |
Should we add a test somewhere? Know we don't have WSL in CI, but maybe it is still useful for debugging issues later |
Sure! How does a gpuCI test checking the value of |
@charlesbluca I think that would be great! |
async def test_wsl_monitoring_enabled(s, a, b): | ||
assert nvml.nvmlInitialized is True | ||
assert nvml.nvmlWslInsufficientDriver is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could limit this test to only run on WSL2, but I figure it's probably good to run it everywhere to ensure that _in_wsl
is working expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Also having a test that can run everywhere (including w/o WSL) is more valuable
Seeing the following error on CI: ___________________________ test_enable_disable_nvml ___________________________
def test_enable_disable_nvml():
try:
pynvml.nvmlShutdown()
except pynvml.NVMLError_Uninitialized:
pass
else:
nvml.nvmlInitialized = False
with dask.config.set({"distributed.diagnostics.nvml": False}):
nvml.init_once()
assert nvml.nvmlInitialized is False
with dask.config.set({"distributed.diagnostics.nvml": True}):
nvml.init_once()
> assert nvml.nvmlInitialized is True
E assert False is True
E + where False = nvml.nvmlInitialized
distributed/diagnostics/tests/test_nvml.py:41: AssertionError
_________________________ test_wsl_monitoring_enabled __________________________
s = <Scheduler 'tcp://127.0.0.1:41477', workers: 0, cores: 0, tasks: 0>
a = <Worker 'tcp://127.0.0.1:45201', name: 0, status: closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>
b = <Worker 'tcp://127.0.0.1:45247', name: 1, status: closed, stored: 0, running: 0/2, ready: 0, comm: 0, waiting: 0>
@gen_cluster()
async def test_wsl_monitoring_enabled(s, a, b):
> assert nvml.nvmlInitialized is True
E assert False is True
E + where False = nvml.nvmlInitialized
distributed/diagnostics/tests/test_nvml.py:157: AssertionError Maybe we need to generalize the test a bit? |
Ah I understand the issue here - this is a consequence of my moving the line setting Looking through the code, this seemed like a sensible change, as it didn't really make sense to me to set I'll also note that this does make this a breaking change for any downstream libraries that were depending on I think the best course of action here if we opt to keep this new behavior for |
I'm now wondering if we should really do that. We're effectively forcing all WSL2 users to upgrade to 512.15 or higher, right? Perhaps instead we should allow old drivers to still work but mark |
Not actually - when a RuntimeError is raised due to insufficient WSL2 system requirements during initialization, it is caught here and NVML monitoring is disabled: distributed/distributed/worker.py Lines 4688 to 4692 in 6daf3bf
These errors only pop up directly if a user tries to grab the device handles by explicitly calling |
Got it, thanks for reminding me. Yes, I agree then, changing the test seems like the sensible approach here, please feel free to do so when you have the chance. |
Think things should be good now testing-wise:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @charlesbluca .
@jakirkham I think we're good to merge if no other concerns exist.
Thanks all! 😄 |
After trying out NVML monitoring on WSL2 with the latest NVIDIA drivers, it looks like a lot of the issues encountered before are no longer occurring:
nvmlDeviceGetComputeRunningProcesses
nvmlDeviceGetUtilizationRates
This PR re-enables NVML monitoring on WSL2 with the caveat that the NVIDIA driver version must be at or above the latest version as of now (512.15); if it isn't, monitoring will be disabled as was the case before. This should hopefully reduce the number of issues opened around WSL2 that boil down to outdated drivers.
cc @pentschev
pre-commit run --all-files