Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable NVML monitoring for WSL #6119

Merged
merged 11 commits into from
May 4, 2022

Conversation

charlesbluca
Copy link
Member

After trying out NVML monitoring on WSL2 with the latest NVIDIA drivers, it looks like a lot of the issues encountered before are no longer occurring:

  • we are now able to query active processes using nvmlDeviceGetComputeRunningProcesses
  • we are now able to query GPU utilization using nvmlDeviceGetUtilizationRates

This PR re-enables NVML monitoring on WSL2 with the caveat that the NVIDIA driver version must be at or above the latest version as of now (512.15); if it isn't, monitoring will be disabled as was the case before. This should hopefully reduce the number of issues opened around WSL2 that boil down to outdated drivers.

cc @pentschev

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall @charlesbluca , I've added a few suggestions.

distributed/diagnostics/nvml.py Outdated Show resolved Hide resolved
distributed/diagnostics/nvml.py Show resolved Hide resolved
distributed/diagnostics/nvml.py Outdated Show resolved Hide resolved
distributed/diagnostics/nvml.py Outdated Show resolved Hide resolved
charlesbluca and others added 2 commits April 13, 2022 13:14
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @charlesbluca !

@pentschev
Copy link
Member

@quasiben @jakirkham could one of you take a look/merge?

@quasiben
Copy link
Member

Looks good. I'll plan to merge once CI finishes

@github-actions
Copy link
Contributor

github-actions bot commented Apr 14, 2022

Unit Test Results

       16 files  ±  0         16 suites  ±0   7h 37m 35s ⏱️ - 7m 48s
  2 744 tests +  3    2 662 ✔️ +  2       80 💤 ±0  2 +1 
21 832 runs  +19  20 795 ✔️ +16  1 035 💤 +2  2 +1 

For more details on these failures, see this check.

Results for commit 5c4e74a. ± Comparison against base commit cdbb426.

♻️ This comment has been updated with latest results.

@jakirkham
Copy link
Member

Should we add a test somewhere? Know we don't have WSL in CI, but maybe it is still useful for debugging issues later

@charlesbluca
Copy link
Member Author

Sure! How does a gpuCI test checking the value of nvmlWslInsufficientDriver sound?

@quasiben
Copy link
Member

@charlesbluca I think that would be great!

Comment on lines 156 to 158
async def test_wsl_monitoring_enabled(s, a, b):
assert nvml.nvmlInitialized is True
assert nvml.nvmlWslInsufficientDriver is False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could limit this test to only run on WSL2, but I figure it's probably good to run it everywhere to ensure that _in_wsl is working expected

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Also having a test that can run everywhere (including w/o WSL) is more valuable

@jakirkham
Copy link
Member

Seeing the following error on CI:

___________________________ test_enable_disable_nvml ___________________________

    def test_enable_disable_nvml():
        try:
            pynvml.nvmlShutdown()
        except pynvml.NVMLError_Uninitialized:
            pass
        else:
            nvml.nvmlInitialized = False
    
        with dask.config.set({"distributed.diagnostics.nvml": False}):
            nvml.init_once()
            assert nvml.nvmlInitialized is False
    
        with dask.config.set({"distributed.diagnostics.nvml": True}):
            nvml.init_once()
>           assert nvml.nvmlInitialized is True
E           assert False is True
E            +  where False = nvml.nvmlInitialized

distributed/diagnostics/tests/test_nvml.py:41: AssertionError
_________________________ test_wsl_monitoring_enabled __________________________

s = <Scheduler 'tcp://127.0.0.1:41477', workers: 0, cores: 0, tasks: 0>
a = <Worker 'tcp://127.0.0.1:45201', name: 0, status: closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>
b = <Worker 'tcp://127.0.0.1:45247', name: 1, status: closed, stored: 0, running: 0/2, ready: 0, comm: 0, waiting: 0>

    @gen_cluster()
    async def test_wsl_monitoring_enabled(s, a, b):
>       assert nvml.nvmlInitialized is True
E       assert False is True
E        +  where False = nvml.nvmlInitialized

distributed/diagnostics/tests/test_nvml.py:157: AssertionError

Maybe we need to generalize the test a bit?

@charlesbluca
Copy link
Member Author

Ah I understand the issue here - this is a consequence of my moving the line setting nvmlInitialized to True to be after we actually call pynvml.nvmlInit(), which fails on GPU runners since they don't have NVML libraries installed and cancels the initialization process.

Looking through the code, this seemed like a sensible change, as it didn't really make sense to me to set nvmlInitialized to be True if the actual nvmlInit command failed, and it seemed like we were mostly doing it to prevent init_once from running in its entirety on subsequent calls, which I ended up handling by just expanding the checks we do before attempting to call nvmlInit (we now check the values of nvmlLibraryNotFound and nvmlWslInsufficientDriver) - @pentschev, not sure if you have any insights here.

I'll also note that this does make this a breaking change for any downstream libraries that were depending on nvmlInitialized (though I imagine there are few if any).

I think the best course of action here if we opt to keep this new behavior for nvmlInitialized is to modify the impacted tests to more robustly handle NVML failure conditions - for example, test_enable_disable_nvml should check that if NVML monitoring is enabled, only one out of nvmlInitialized, nvmlLibraryNotFound, and nvmlWslInsufficientDriver should be True.

@pentschev
Copy link
Member

Looking through the code, this seemed like a sensible change, as it didn't really make sense to me to set nvmlInitialized to be True if the actual nvmlInit command failed, and it seemed like we were mostly doing it to prevent init_once from running in its entirety on subsequent calls, which I ended up handling by just expanding the checks we do before attempting to call nvmlInit (we now check the values of nvmlLibraryNotFound and nvmlWslInsufficientDriver) - @pentschev, not sure if you have any insights here.

I'm now wondering if we should really do that. We're effectively forcing all WSL2 users to upgrade to 512.15 or higher, right? Perhaps instead we should allow old drivers to still work but mark nvmlInitialized = False if driver version is insufficient. I'm thinking we may have people using Dask with WSL2 who may not be able to immediately upgrade, so we could print a warning telling them to upgrade to enable NVML monitoring.

@charlesbluca
Copy link
Member Author

charlesbluca commented Apr 15, 2022

We're effectively forcing all WSL2 users to upgrade to 512.15 or higher, right?

Not actually - when a RuntimeError is raised due to insufficient WSL2 system requirements during initialization, it is caught here and NVML monitoring is disabled:

try:
if nvml.device_get_count() < 1:
raise RuntimeError
except (Exception, RuntimeError):
pass

These errors only pop up directly if a user tries to grab the device handles by explicitly calling _pynvml_handles - this means that if a user was on WSL2 with outdated drivers, failures would only pop up when running the NVML tests, which is what I think we want here (i.e. some kind of test failure to indicate that a WSL2 setup is not what we're expecting)?

@pentschev
Copy link
Member

These errors only pop up directly if a user tries to grab the device handles by explicitly calling _pynvml_handles - this means that if a user was on WSL2 with outdated drivers, failures would only pop up when running the NVML tests, which is what I think we want here (i.e. some kind of test failure to indicate that a WSL2 setup is not what we're expecting)?

Got it, thanks for reminding me. Yes, I agree then, changing the test seems like the sensible approach here, please feel free to do so when you have the chance.

@charlesbluca
Copy link
Member Author

Think things should be good now testing-wise:

  • test_enable_disable_nvml now generally checks that exactly one of the NVML init flags has been set to True after attempting initialization
  • test_wsl_monitoring_enabled now just checks that nvmlWslInsufficientDriver is False, so that we have a test failing if a GPU-enabled WSL setup has outdated drivers.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @charlesbluca .

@jakirkham I think we're good to merge if no other concerns exist.

@jakirkham jakirkham merged commit baf05c0 into dask:main May 4, 2022
@jakirkham
Copy link
Member

Thanks all! 😄

@charlesbluca charlesbluca deleted the enable-wsl-monitoring branch July 20, 2022 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants