Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA support in PyTorch broken after update to 1.5.9 stable #1376

Closed
sbrunk opened this issue Jun 23, 2023 · 6 comments
Closed

CUDA support in PyTorch broken after update to 1.5.9 stable #1376

sbrunk opened this issue Jun 23, 2023 · 6 comments
Assignees
Labels

Comments

@sbrunk
Copy link
Contributor

sbrunk commented Jun 23, 2023

I'm using cuda-platform-redist to avoid being dependent on as system-installed cuda, which was working great using the 1.5.9-SNAPHSOT versions and cuda 11.8-8.6. Now after upgrading to 1.5.9 stable in conjuction with the cuda update to 12.1-8.9, something seems to be missing:

[W interface.cpp:47] Warning: Loading nvfuser library failed with: Error in dlopen: libcusolver.so.11: cannot open shared object file: No such file or directory (function LoadingNvfuserLibrary)
...
pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch.so: libcusolver.so.11: cannot open shared object file: No such file or directory

I've tried going back to cuda 11.8-8.6-1.5.8 (only the native libs via classifier to avoid javacpp-1.5.8 being pulled in), but I got other linking errors due to missing cuda 12 libs, which suggests libtorch is now built against cuda 12:

pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch.so: libcudart.so.12: cannot open shared object file: No such file or directory

Any ideas?

@sbrunk
Copy link
Contributor Author

sbrunk commented Jun 23, 2023

Ok this is interesting libcusolver.so.11 is part of the cuda jar, but it isn't copied into the javacpp cache dir. It seems to work when I copy the lib manually into the cache into the pytorch native libs.

Still waiting for the cuda kernels to be compiled, but before it errored out immediately...

Any idea why it is not extracted like the other files?

@saudet
Copy link
Member

saudet commented Jun 24, 2023

Looks like I forgot to add a line for cusolver@.11 in the presets.
@HGuillemet Please apply something like this in your pull request.

--- a/pytorch/src/main/java/org/bytedeco/pytorch/presets/torch.java
+++ b/pytorch/src/main/java/org/bytedeco/pytorch/presets/torch.java
@@ -1816,6 +1816,7 @@ public class torch implements LoadEnabled, InfoMapper {
                      : lib.equals("nvinfer") ? "@.8"
                      : lib.equals("cufft") ? "@.11"
                      : lib.equals("curand") ? "@.10"
+                     : lib.equals("cusolver") ? "@.11"
                      : lib.equals("cudart") ? "@.12"
                      : lib.equals("nvrtc") ? "@.12"
                      : lib.equals("nvJitLink") ? "@.12"
@@ -1827,6 +1828,7 @@ public class torch implements LoadEnabled, InfoMapper {
                      : lib.equals("nvinfer") ? "64_8"
                      : lib.equals("cufft") ? "64_11"
                      : lib.equals("curand") ? "64_10"
+                     : lib.equals("cusolver") ? "64_11"
                      : lib.equals("cudart") ? "64_12"
                      : lib.equals("nvrtc") ? "64_120_0"
                      : lib.equals("nvJitLink") ? "64_120_0"

@sbrunk In the meantime, we can work around that by calling Loader.load(cusolver.class).

@sbrunk
Copy link
Contributor Author

sbrunk commented Jun 26, 2023

Thanks @saudet for looking into it and for the workaround.

For reference, the workaround needs cuda-platform/cuda in addtion to cuda-redist to have cusolver on the classpath. It also wants to load native bindings like libjnicudart.

Also since I have to support CPU only builds as well cusolver might not be available in my case, so I'm checking if it's on the classpath:

  try {
    val cusolver = Class.forName("org.bytedeco.cuda.global.cusolver")
    Loader.load(cusolver)
  } catch {
    case e: ClassNotFoundException => // ignore to avoid breaking CPU only builds
  }

The only downside is that I need to ensure this is run before any tensor operations. I guess the only way to avoid this is to actually have a patched presets/torch.java to reliably trigger JavaCPPs loading magic, right?

@saudet
Copy link
Member

saudet commented Jun 26, 2023

You don't have a "common/utils/whatever" class in which you can put stuff like that in a static { }... ?

@sbrunk
Copy link
Contributor Author

sbrunk commented Jun 26, 2023

I do, but I wasn't able to find a way to always trigger loading of that utils class containing the static block before doing native calls, due to a combination of me using Scala top-level methods and Scala being really bad with static methods.

I've refactored my code now in a way that should trigger the cusolver loading reliably before calling any native code.

@sbrunk
Copy link
Contributor Author

sbrunk commented Jul 26, 2023

This is now fixed via #1360

@sbrunk sbrunk closed this as completed Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants