CUDA support in PyTorch broken after update to 1.5.9 stable #1376

sbrunk · 2023-06-23T13:21:12Z

I'm using cuda-platform-redist to avoid being dependent on as system-installed cuda, which was working great using the 1.5.9-SNAPHSOT versions and cuda 11.8-8.6. Now after upgrading to 1.5.9 stable in conjuction with the cuda update to 12.1-8.9, something seems to be missing:

[W interface.cpp:47] Warning: Loading nvfuser library failed with: Error in dlopen: libcusolver.so.11: cannot open shared object file: No such file or directory (function LoadingNvfuserLibrary)
...
pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch.so: libcusolver.so.11: cannot open shared object file: No such file or directory

I've tried going back to cuda 11.8-8.6-1.5.8 (only the native libs via classifier to avoid javacpp-1.5.8 being pulled in), but I got other linking errors due to missing cuda 12 libs, which suggests libtorch is now built against cuda 12:

pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch.so: libcudart.so.12: cannot open shared object file: No such file or directory

Any ideas?

The text was updated successfully, but these errors were encountered:

sbrunk · 2023-06-23T16:05:24Z

Ok this is interesting libcusolver.so.11 is part of the cuda jar, but it isn't copied into the javacpp cache dir. It seems to work when I copy the lib manually into the cache into the pytorch native libs.

Still waiting for the cuda kernels to be compiled, but before it errored out immediately...

Any idea why it is not extracted like the other files?

saudet · 2023-06-24T01:43:29Z

Looks like I forgot to add a line for cusolver@.11 in the presets.
@HGuillemet Please apply something like this in your pull request.

--- a/pytorch/src/main/java/org/bytedeco/pytorch/presets/torch.java
+++ b/pytorch/src/main/java/org/bytedeco/pytorch/presets/torch.java
@@ -1816,6 +1816,7 @@ public class torch implements LoadEnabled, InfoMapper {
                      : lib.equals("nvinfer") ? "@.8"
                      : lib.equals("cufft") ? "@.11"
                      : lib.equals("curand") ? "@.10"
+                     : lib.equals("cusolver") ? "@.11"
                      : lib.equals("cudart") ? "@.12"
                      : lib.equals("nvrtc") ? "@.12"
                      : lib.equals("nvJitLink") ? "@.12"
@@ -1827,6 +1828,7 @@ public class torch implements LoadEnabled, InfoMapper {
                      : lib.equals("nvinfer") ? "64_8"
                      : lib.equals("cufft") ? "64_11"
                      : lib.equals("curand") ? "64_10"
+                     : lib.equals("cusolver") ? "64_11"
                      : lib.equals("cudart") ? "64_12"
                      : lib.equals("nvrtc") ? "64_120_0"
                      : lib.equals("nvJitLink") ? "64_120_0"

@sbrunk In the meantime, we can work around that by calling Loader.load(cusolver.class).

See bytedeco/javacpp-presets#1376

sbrunk · 2023-06-26T06:59:13Z

Thanks @saudet for looking into it and for the workaround.

For reference, the workaround needs cuda-platform/cuda in addtion to cuda-redist to have cusolver on the classpath. It also wants to load native bindings like libjnicudart.

Also since I have to support CPU only builds as well cusolver might not be available in my case, so I'm checking if it's on the classpath:

  try {
    val cusolver = Class.forName("org.bytedeco.cuda.global.cusolver")
    Loader.load(cusolver)
  } catch {
    case e: ClassNotFoundException => // ignore to avoid breaking CPU only builds
  }

The only downside is that I need to ensure this is run before any tensor operations. I guess the only way to avoid this is to actually have a patched presets/torch.java to reliably trigger JavaCPPs loading magic, right?

saudet · 2023-06-26T07:27:59Z

You don't have a "common/utils/whatever" class in which you can put stuff like that in a static { }... ?

sbrunk · 2023-06-26T16:33:38Z

I do, but I wasn't able to find a way to always trigger loading of that utils class containing the static block before doing native calls, due to a combination of me using Scala top-level methods and Scala being really bad with static methods.

I've refactored my code now in a way that should trigger the cusolver loading reliably before calling any native code.

bytedeco/javacpp-presets#1376

Issue: bytedeco/javacpp-presets#1376 Fixed with bytedeco/javacpp-presets#1360

sbrunk · 2023-07-26T21:36:00Z

This is now fixed via #1360

saudet added the bug label Jun 24, 2023

saudet assigned HGuillemet Jun 24, 2023

sbrunk added a commit to sbrunk/storch that referenced this issue Jun 25, 2023

Add workaround for missing libcusolver after cuda 12 upgrade

1278777

See bytedeco/javacpp-presets#1376

sbrunk mentioned this issue Jun 27, 2023

Unsatisfied link error: jnicudart.dll sbrunk/storch#33

Open

sbrunk added a commit to sbrunk/storch that referenced this issue Jun 28, 2023

Update install docs to take into account the cuda workaround

d29435d

bytedeco/javacpp-presets#1376

sbrunk mentioned this issue Jun 28, 2023

Update install docs to take into account the cuda workaround sbrunk/storch#37

Merged

sbrunk added a commit to sbrunk/storch that referenced this issue Jul 26, 2023

Remove workarounds for cusolver loading issue

e2c2825

Issue: bytedeco/javacpp-presets#1376 Fixed with bytedeco/javacpp-presets#1360

sbrunk closed this as completed Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA support in PyTorch broken after update to 1.5.9 stable #1376

CUDA support in PyTorch broken after update to 1.5.9 stable #1376

sbrunk commented Jun 23, 2023 •

edited

Loading

sbrunk commented Jun 23, 2023

saudet commented Jun 24, 2023

sbrunk commented Jun 26, 2023

saudet commented Jun 26, 2023

sbrunk commented Jun 26, 2023

sbrunk commented Jul 26, 2023

CUDA support in PyTorch broken after update to 1.5.9 stable #1376

CUDA support in PyTorch broken after update to 1.5.9 stable #1376

Comments

sbrunk commented Jun 23, 2023 • edited Loading

sbrunk commented Jun 23, 2023

saudet commented Jun 24, 2023

sbrunk commented Jun 26, 2023

saudet commented Jun 26, 2023

sbrunk commented Jun 26, 2023

sbrunk commented Jul 26, 2023

sbrunk commented Jun 23, 2023 •

edited

Loading