Re-enable NVML monitoring for WSL #6119

charlesbluca · 2022-04-13T16:44:34Z

After trying out NVML monitoring on WSL2 with the latest NVIDIA drivers, it looks like a lot of the issues encountered before are no longer occurring:

we are now able to query active processes using nvmlDeviceGetComputeRunningProcesses
we are now able to query GPU utilization using nvmlDeviceGetUtilizationRates

This PR re-enables NVML monitoring on WSL2 with the caveat that the NVIDIA driver version must be at or above the latest version as of now (512.15); if it isn't, monitoring will be disabled as was the case before. This should hopefully reduce the number of issues opened around WSL2 that boil down to outdated drivers.

cc @pentschev

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

distributed/diagnostics/nvml.py

pentschev

Looks good overall @charlesbluca , I've added a few suggestions.

distributed/diagnostics/nvml.py

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

pentschev

LGTM, thanks @charlesbluca !

pentschev · 2022-04-13T18:28:21Z

@quasiben @jakirkham could one of you take a look/merge?

quasiben · 2022-04-13T20:45:56Z

Looks good. I'll plan to merge once CI finishes

github-actions · 2022-04-14T02:20:25Z

Unit Test Results

      16 files ±  0       16 suites ±0 7h 37m 35s ⏱️ - 7m 48s
  2 744 tests +  3   2 662 ✔️ +  2     80 💤 ±0 2 ❌ +1
21 832 runs +19 20 795 ✔️ +16 1 035 💤 +2 2 ❌ +1

For more details on these failures, see this check.

Results for commit 5c4e74a. ± Comparison against base commit cdbb426.

♻️ This comment has been updated with latest results.

jakirkham · 2022-04-14T05:56:54Z

Should we add a test somewhere? Know we don't have WSL in CI, but maybe it is still useful for debugging issues later

charlesbluca · 2022-04-14T13:25:39Z

Sure! How does a gpuCI test checking the value of nvmlWslInsufficientDriver sound?

quasiben · 2022-04-14T15:04:01Z

@charlesbluca I think that would be great!

charlesbluca · 2022-04-14T21:58:36Z

distributed/diagnostics/tests/test_nvml.py

+async def test_wsl_monitoring_enabled(s, a, b):
+    assert nvml.nvmlInitialized is True
+    assert nvml.nvmlWslInsufficientDriver is False


We could limit this test to only run on WSL2, but I figure it's probably good to run it everywhere to ensure that _in_wsl is working expected

Agree. Also having a test that can run everywhere (including w/o WSL) is more valuable

jakirkham · 2022-04-15T03:28:04Z

Seeing the following error on CI:

___________________________ test_enable_disable_nvml ___________________________

    def test_enable_disable_nvml():
        try:
            pynvml.nvmlShutdown()
        except pynvml.NVMLError_Uninitialized:
            pass
        else:
            nvml.nvmlInitialized = False
    
        with dask.config.set({"distributed.diagnostics.nvml": False}):
            nvml.init_once()
            assert nvml.nvmlInitialized is False
    
        with dask.config.set({"distributed.diagnostics.nvml": True}):
            nvml.init_once()
>           assert nvml.nvmlInitialized is True
E           assert False is True
E            +  where False = nvml.nvmlInitialized

distributed/diagnostics/tests/test_nvml.py:41: AssertionError
_________________________ test_wsl_monitoring_enabled __________________________

s = <Scheduler 'tcp://127.0.0.1:41477', workers: 0, cores: 0, tasks: 0>
a = <Worker 'tcp://127.0.0.1:45201', name: 0, status: closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>
b = <Worker 'tcp://127.0.0.1:45247', name: 1, status: closed, stored: 0, running: 0/2, ready: 0, comm: 0, waiting: 0>

    @gen_cluster()
    async def test_wsl_monitoring_enabled(s, a, b):
>       assert nvml.nvmlInitialized is True
E       assert False is True
E        +  where False = nvml.nvmlInitialized

distributed/diagnostics/tests/test_nvml.py:157: AssertionError

Maybe we need to generalize the test a bit?

charlesbluca · 2022-04-15T04:54:06Z

Ah I understand the issue here - this is a consequence of my moving the line setting nvmlInitialized to True to be after we actually call pynvml.nvmlInit(), which fails on GPU runners since they don't have NVML libraries installed and cancels the initialization process.

Looking through the code, this seemed like a sensible change, as it didn't really make sense to me to set nvmlInitialized to be True if the actual nvmlInit command failed, and it seemed like we were mostly doing it to prevent init_once from running in its entirety on subsequent calls, which I ended up handling by just expanding the checks we do before attempting to call nvmlInit (we now check the values of nvmlLibraryNotFound and nvmlWslInsufficientDriver) - @pentschev, not sure if you have any insights here.

I'll also note that this does make this a breaking change for any downstream libraries that were depending on nvmlInitialized (though I imagine there are few if any).

I think the best course of action here if we opt to keep this new behavior for nvmlInitialized is to modify the impacted tests to more robustly handle NVML failure conditions - for example, test_enable_disable_nvml should check that if NVML monitoring is enabled, only one out of nvmlInitialized, nvmlLibraryNotFound, and nvmlWslInsufficientDriver should be True.

pentschev · 2022-04-15T10:18:20Z

Looking through the code, this seemed like a sensible change, as it didn't really make sense to me to set nvmlInitialized to be True if the actual nvmlInit command failed, and it seemed like we were mostly doing it to prevent init_once from running in its entirety on subsequent calls, which I ended up handling by just expanding the checks we do before attempting to call nvmlInit (we now check the values of nvmlLibraryNotFound and nvmlWslInsufficientDriver) - @pentschev, not sure if you have any insights here.

I'm now wondering if we should really do that. We're effectively forcing all WSL2 users to upgrade to 512.15 or higher, right? Perhaps instead we should allow old drivers to still work but mark nvmlInitialized = False if driver version is insufficient. I'm thinking we may have people using Dask with WSL2 who may not be able to immediately upgrade, so we could print a warning telling them to upgrade to enable NVML monitoring.

charlesbluca · 2022-04-15T13:30:25Z

We're effectively forcing all WSL2 users to upgrade to 512.15 or higher, right?

Not actually - when a RuntimeError is raised due to insufficient WSL2 system requirements during initialization, it is caught here and NVML monitoring is disabled:

distributed/distributed/worker.py

Lines 4688 to 4692 in 6daf3bf

    
           try: 
        
               if nvml.device_get_count() < 1: 
        
                   raise RuntimeError 
        
           except (Exception, RuntimeError): 
        
               pass

These errors only pop up directly if a user tries to grab the device handles by explicitly calling _pynvml_handles - this means that if a user was on WSL2 with outdated drivers, failures would only pop up when running the NVML tests, which is what I think we want here (i.e. some kind of test failure to indicate that a WSL2 setup is not what we're expecting)?

pentschev · 2022-04-15T20:45:05Z

These errors only pop up directly if a user tries to grab the device handles by explicitly calling _pynvml_handles - this means that if a user was on WSL2 with outdated drivers, failures would only pop up when running the NVML tests, which is what I think we want here (i.e. some kind of test failure to indicate that a WSL2 setup is not what we're expecting)?

Got it, thanks for reminding me. Yes, I agree then, changing the test seems like the sensible approach here, please feel free to do so when you have the chance.

charlesbluca · 2022-04-18T14:29:17Z

Think things should be good now testing-wise:

test_enable_disable_nvml now generally checks that exactly one of the NVML init flags has been set to True after attempting initialization
test_wsl_monitoring_enabled now just checks that nvmlWslInsufficientDriver is False, so that we have a test failing if a GPU-enabled WSL setup has outdated drivers.

pentschev

LGTM, thanks @charlesbluca .

@jakirkham I think we're good to merge if no other concerns exist.

jakirkham · 2022-05-04T02:52:03Z

Thanks all! 😄

charlesbluca added 3 commits April 13, 2022 12:34

Re-enable WSL NVML monitoring for certain driver versions

1cd2cd0

Add handling for no pynvml installation

2aa2849

Only return CUDA context if nvml is initialized

1c5de67

charlesbluca commented Apr 13, 2022

View reviewed changes

distributed/diagnostics/nvml.py Outdated Show resolved Hide resolved

pentschev suggested changes Apr 13, 2022

View reviewed changes

distributed/diagnostics/nvml.py Outdated Show resolved Hide resolved

distributed/diagnostics/nvml.py Show resolved Hide resolved

distributed/diagnostics/nvml.py Outdated Show resolved Hide resolved

distributed/diagnostics/nvml.py Outdated Show resolved Hide resolved

charlesbluca and others added 2 commits April 13, 2022 13:14

Apply suggestions from code review

58a4349

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

Split up pynvml / nvml error handling

f4b4023

pentschev approved these changes Apr 13, 2022

View reviewed changes

Add WSL2 monitoring test

f4c2a8f

charlesbluca commented Apr 14, 2022

View reviewed changes

jakirkham approved these changes Apr 15, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into enable-wsl-monitoring

602a09c

charlesbluca added 4 commits April 18, 2022 10:11

Update NVML tests to reflect changes

f15f097

Add back in WSL monitoring test

bad8aba

Set driver version back to current latest

15c483e

Use XOR when checking NVML init flags

5c4e74a

pentschev approved these changes Apr 19, 2022

View reviewed changes

jakirkham merged commit baf05c0 into dask:main May 4, 2022

charlesbluca deleted the enable-wsl-monitoring branch July 20, 2022 02:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable NVML monitoring for WSL #6119

Re-enable NVML monitoring for WSL #6119

charlesbluca commented Apr 13, 2022

pentschev left a comment

pentschev left a comment

pentschev commented Apr 13, 2022

quasiben commented Apr 13, 2022

github-actions bot commented Apr 14, 2022 •

edited

Loading

jakirkham commented Apr 14, 2022

charlesbluca commented Apr 14, 2022

quasiben commented Apr 14, 2022

charlesbluca Apr 14, 2022

jakirkham Apr 14, 2022

jakirkham commented Apr 15, 2022

charlesbluca commented Apr 15, 2022

pentschev commented Apr 15, 2022

charlesbluca commented Apr 15, 2022 •

edited

Loading

pentschev commented Apr 15, 2022

charlesbluca commented Apr 18, 2022

pentschev left a comment

jakirkham commented May 4, 2022

Re-enable NVML monitoring for WSL #6119

Re-enable NVML monitoring for WSL #6119

Conversation

charlesbluca commented Apr 13, 2022

pentschev left a comment

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

pentschev commented Apr 13, 2022

quasiben commented Apr 13, 2022

github-actions bot commented Apr 14, 2022 • edited Loading

Unit Test Results

jakirkham commented Apr 14, 2022

charlesbluca commented Apr 14, 2022

quasiben commented Apr 14, 2022

charlesbluca Apr 14, 2022

Choose a reason for hiding this comment

jakirkham Apr 14, 2022

Choose a reason for hiding this comment

jakirkham commented Apr 15, 2022

charlesbluca commented Apr 15, 2022

pentschev commented Apr 15, 2022

charlesbluca commented Apr 15, 2022 • edited Loading

pentschev commented Apr 15, 2022

charlesbluca commented Apr 18, 2022

pentschev left a comment

Choose a reason for hiding this comment

jakirkham commented May 4, 2022

github-actions bot commented Apr 14, 2022 •

edited

Loading

charlesbluca commented Apr 15, 2022 •

edited

Loading