[BUG] Consider disabling managed memory in cudf.pandas on WSL2 #16551

vyasr · 2024-08-13T22:17:44Z

Describe the bug
cudf.pandas turns on a managed pool allocator by default to support larger-than-memory workloads. However, this does not work on WSL2 because UVM on Windows does not actually allow oversubscription. Moreover, using UVM could result in far worse slowdowns on WSL2 than observed on Windows due to how it is implemented on that platform.

Expected behavior
We should consider changing cudf.pandas to only enable managed memory by default when oversubscription is properly supported. This can be done by querying the CUDA driver for the appropriate attribute. In addition, we should run some benchmarks to evaluate the relative performance impact of using managed memory on WSL2 in undersubscribed situations

bdice · 2024-08-14T04:17:56Z

I can confirm that currently WSL2 fails with cudf.pandas:

import cudf.pandas
cudf.pandas.install()  # Enables managed memory and prefetching
cudf.Series([1, 2, 3])  # Fails!

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/coder/cudf/python/cudf/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/coder/cudf/python/cudf/cudf/core/series.py", line 656, in __init__
    column = as_column(
             ^^^^^^^^^^
  File "/home/coder/cudf/python/cudf/cudf/core/column/column.py", line 2241, in as_column
    return as_column(arbitrary, nan_as_null=nan_as_null, dtype=dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/coder/cudf/python/cudf/cudf/core/column/column.py", line 1868, in as_column
    col = ColumnBase.from_arrow(arbitrary)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/coder/cudf/python/cudf/cudf/core/column/column.py", line 364, in from_arrow
    result = libcudf.interop.from_arrow(data)[0]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/coder/.conda/envs/rapids/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "interop.pyx", line 162, in cudf._lib.interop.from_arrow
  File "/home/coder/.conda/envs/rapids/lib/python3.11/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "interop.pyx", line 142, in cudf._lib.pylibcudf.interop._from_arrow_table
RuntimeError: CUDA error at: /home/coder/.conda/envs/rapids/include/rmm/prefetch.hpp:53: cudaErrorInvalidDevice invalid device ordinal

This means that all cudf.pandas calls will fall back to CPU, and cudf.pandas is effectively just pandas on WSL2. This affects the 24.08 release, too.

In #16552, I have a fix. It works the same as in prior releases, by using a normal pool resource rather than a managed pool, and not enabling prefetching on WSL2 (it detects whether concurrent managed access between CPU/GPU is supported).

bdice · 2024-08-14T04:23:12Z

@vyasr Given that cudf.pandas is broken (CPU only) on WSL2 without #16552, should we consider a hotfix for 24.08?

bdice · 2024-08-14T04:33:23Z

The CUDA docs state:

If dstDevice is a GPU, then the device attribute cudaDevAttrConcurrentManagedAccess must be non-zero.

I suspect that's why we get a cudaErrorInvalidDevice here. In RMM, we already handle the case of attempting to prefetch non-managed memory returning cudaErrorInvalidValue. Should we add similar logic to ignore errors from prefetching on devices that do not support managed memory? That would make the RMM API always succeed, as "try to prefetch if possible." Or should we instead require developers to skip all prefetching code if managed memory is not supported? (This is what I implemented in #16552, because I do not enable the experimental prefetching options if managed memory is not supported.)

vyasr · 2024-08-14T18:43:31Z

We had more discussion offline, so summarizing here:

Yes, we will be hotfixing 24.08.
We're not going to make any changes to rmm/prefetching internals in the hotfix, just disable managed memory whenever we're on a system where it's not supported.
We'll consider more updates to improve testing of this kind of issue in 24.10.

vyasr added the bug Something isn't working label Aug 13, 2024

bdice mentioned this issue Aug 13, 2024

Ensure managed memory is supported in cudf.pandas. #16552

Merged

3 tasks

bdice self-assigned this Aug 14, 2024

AyodeAwe closed this as completed in 4961512 Aug 15, 2024

bdice mentioned this issue Aug 15, 2024

Make rmm::prefetch fault tolerant rapidsai/rmm#1648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Consider disabling managed memory in cudf.pandas on WSL2 #16551

[BUG] Consider disabling managed memory in cudf.pandas on WSL2 #16551

vyasr commented Aug 13, 2024

bdice commented Aug 14, 2024 •

edited

Loading

bdice commented Aug 14, 2024 •

edited

Loading

bdice commented Aug 14, 2024 •

edited

Loading

vyasr commented Aug 14, 2024

[BUG] Consider disabling managed memory in cudf.pandas on WSL2 #16551

[BUG] Consider disabling managed memory in cudf.pandas on WSL2 #16551

Comments

vyasr commented Aug 13, 2024

bdice commented Aug 14, 2024 • edited Loading

bdice commented Aug 14, 2024 • edited Loading

bdice commented Aug 14, 2024 • edited Loading

vyasr commented Aug 14, 2024

bdice commented Aug 14, 2024 •

edited

Loading

bdice commented Aug 14, 2024 •

edited

Loading

bdice commented Aug 14, 2024 •

edited

Loading