Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow-1.0.0-1.3 requires CUDA #396

Closed
thunterdb opened this issue Mar 14, 2017 · 37 comments
Closed

tensorflow-1.0.0-1.3 requires CUDA #396

thunterdb opened this issue Mar 14, 2017 · 37 comments

Comments

@thunterdb
Copy link

It seems that the official version of JavaCPP's TensorFlow in Maven Central has a linking dependency on libcudart. This is problematic for upstream packages that may need to run in a CPU-only environment. Do you have some plans to publish a CPU-only version?

Also, I am not a legal expert, but if the libcudart were to be statically linked, I wonder if the NVidia license would allow the publication of the final artifact on a public repository like Maven Central.

Cause: java.lang.UnsatisfiedLinkError: /home/travis/.javacpp/cache/tensorflow-1.0.0-1.3-linux-x86_64.jar/org/bytedeco/javacpp/linux-x86_64/libjnitensorflow.so: libcudart.so.8.0: cannot open shared object file: No such file or directory

Thank you in advance. This will unblock the next release of TensorFrames:
databricks/tensorframes#74

cc @saudet

@saudet
Copy link
Member

saudet commented Mar 14, 2017

Thanks for the feedback! The previous build on Maven Central also links with CUDA, this is not new. Is this a new requirement for TensorFrames?

@saudet
Copy link
Member

saudet commented Mar 15, 2017

I think linking statically with cudart is reasonable. I'm not sure if we're technically allowed, but we've been doing it as part of ND4J (DL4J) because it turned out that the ABI of different patch versions of CUDA were sometimes incompatible among themselves.

Anyway, here are SNAPSHOT binaries for linux-x86_64:
https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-presets/tensorflow/1.0.1-1.3.3-SNAPSHOT/
Could you give those a try and let me know if there is anything else that would need to be fixed? Thanks!

saudet added a commit that referenced this issue Mar 15, 2017
 * Link TensorFlow statically with `cudart` to avoid dependency on CUDA (issue #396)
@thunterdb
Copy link
Author

Thank you Samuel. The previous version of TensorFrames was using 0.8.0, which was not compiled with GPU acceleration.

Speaking about ancient versions, we are getting there. I am just hitting an ABI compatibility issue with travis [1]:

Cause: java.lang.UnsatisfiedLinkError: /home/travis/.javacpp/cache/tensorflow-1.0.1-1.3.3-SNAPSHOT-linux-x86_64.jar/org/bytedeco/javacpp/linux-x86_64/libjnitensorflow.so: 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/travis/.javacpp/cache/tensorflow-1.0.1-1.3.3-SNAPSHOT-linux-x86_64.jar/org/bytedeco/javacpp/linux-x86_64/libjnitensorflow.so)

Would you mind trying to compile tensorflow with GCC <= 4.9 and libstdc++.so.6.0.19 or older? From a quick look at the java tensorflow experimental bindings, this is the compatibility level they are also targeting.

Thanks!

@saudet
Copy link
Member

saudet commented Mar 15, 2017

Yes, I'll do the release with CentOS 7, so that won't be a problem. Anything else?

@thunterdb
Copy link
Author

This is the only issue I can think of (but then I was not expecting the ABI compatibility issue either...)

@saudet
Copy link
Member

saudet commented Mar 16, 2017

Ok, just to make sure that everything is alright, I've redeployed the SNAPSHOT binaries from CentOS 7. Let me know how those do! Thanks

@thunterdb
Copy link
Author

@saudet I confirm that the latest artifact works as expected in a CPU-only environment. You can close the ticket, I am looking forward to the next release.

On a personal note, if by any chance you have the time to build the macosx binaries as well, I will be very grateful.

@saudet
Copy link
Member

saudet commented Mar 17, 2017

Sure thing, I've deployed SNAPSHOT binaries for Mac as well. Let's make sure those work properly before a release! Thanks

@thunterdb
Copy link
Author

@saudet It looks like the mac build requires cuda, which I have not tried to install on my machine. It currently crashes the JVM. Here is the relevant error:

I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcublas.8.0.dylib. LD_LIBRARY_PATH: /usr/local/mysql/lib/:
I tensorflow/stream_executor/cuda/cuda_blas.cc:2294] Unable to load cuBLAS DSO.
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcudnn.5.dylib. LD_LIBRARY_PATH: /usr/local/mysql/lib/:
I tensorflow/stream_executor/cuda/cuda_dnn.cc:3517] Unable to load cuDNN DSO
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcufft.8.0.dylib. LD_LIBRARY_PATH: /usr/local/mysql/lib/:
I tensorflow/stream_executor/cuda/cuda_fft.cc:338] Unable to load cuFFT DSO.
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcuda.1.dylib. LD_LIBRARY_PATH: /usr/local/mysql/lib/:
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcuda.dylib. LD_LIBRARY_PATH: /usr/local/mysql/lib/:
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: Timothys-MacBook-Pro-2.local
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fff99444132, pid=7405, tid=25879
#
# JRE version: Java(TM) SE Runtime Environment (8.0_65-b17) (build 1.8.0_65-b17)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.65-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libsystem_c.dylib+0x1132]  strlen+0x12

@saudet
Copy link
Member

saudet commented Mar 19, 2017

Interesting. I thought I had some problem with my old machine, so I tried on a newer one with 10.12, but I get exactly the same thing. Seems to be a known regression: tensorflow/tensorflow#2980 (comment)

So, what do you think we should do?

@saudet
Copy link
Member

saudet commented Mar 19, 2017

Actually, the workaround mentioned in the issue above appears to work for me. If I set the LD_LIBRARY_PATH environment variable to something like "/usr/lib", then it magically runs! Would that be satisfactory?

@thunterdb
Copy link
Author

@saudet thanks for looking into it, and for the pointers. It looks like it is a know issue with recent versions of TensorFlow, and it depends on particular combinations of TensorFlow + Java + Hardware. I still experience the issue on macOS, but I can use docker to run the tests locally. Not running the macos build is a slight inconvenience and unless other people are experiencing the same issue, I am happy with the current artifacts. Feel free to close the ticket.

By the way, it looks like you have published a new pom file on Saturday (tensorflow-1.0.1-1.3.3-20170318.232613-4.pom) but did not push the jar files.

@akdeoras
Copy link

@saudet many thanks for releasing new BINARIES. I too agree with @thunterdb that not being able to run on a mac machine is a little inconvenient. I hence wanted to know your opinion on releasing specifically no cuda versions for mac and linux similar to how Google people are maybe planning to do too: TFlow-JAVA-Readme

@saudet
Copy link
Member

saudet commented Mar 23, 2017

@akdeoras Yes that's fine, would you be willing to make a contribution?

@akdeoras
Copy link

@saudet let me give it a shot. I did build locally with CUDA and all worked as expected on mac and linux.

@saudet
Copy link
Member

saudet commented Mar 24, 2017 via email

@akdeoras
Copy link

Yes, thats what I mean too :)

@saudet
Copy link
Member

saudet commented Mar 25, 2017

Awesome! I'll be waiting for your pull request.

In the mean time, I've released new binaries:
http://repo1.maven.org/maven2/org/bytedeco/javacpp-presets/tensorflow/1.0.1-1.3/
We can use them without CUDA by executing, for example, the following before using them:

export LD_LIBRARY_PATH=/usr/lib

@saudet saudet closed this as completed Mar 25, 2017
@akdeoras
Copy link

@saudet I tried your above release, but it did not work for me (I think even @thunterdb faced similar exceptions). About the code changes, I realized that it wont suffice to just change tensorflow project i.e. to have a CUDA and no CUDA separate releases. You have a bunch of CPP projects in there
and they all have a very consistent build process and release process.
Releasing CUDA and no CUDA jars for tensorflow would mean we will have to update other
projects too. That seems like a big change to me.

Do you have any advice on how to go about it ?

@saudet
Copy link
Member

saudet commented Mar 26, 2017 via email

@akdeoras
Copy link

akdeoras commented Mar 27, 2017

You are right. What I meant is that if we change Tensorflow preset to now publish 2 jars, one for CUDA and another for no CUDA, then we will have to change the names of the jar to something like:
tensorflow-1.0.1-1.3-macosx-x86_64_noCUDA.jar
Since you have a pretty consistent arch name embedded in the jar file names of all the other preset projects, changing the arch name from 'macosx-x86_64' to 'macosx-x86_64_noCUDA' would make tensorflow preset project different from others. Is that OK ?

@saudet
Copy link
Member

saudet commented Mar 27, 2017

It sounds reasonable, yes, but not very useful if someone wants to use CUDA when CUDA is available...

@saudet
Copy link
Member

saudet commented Mar 27, 2017

So, instead of having different architecture names, the names of the libraries should different. A bit like we have to do with FFTW to get it working for both float and double data:
https://github.com/bytedeco/javacpp-presets/tree/master/fftw

@akdeoras
Copy link

Thanks @saudet. I looked at fftw project and understood what you are suggesting. So if I understand correctly, the process will be to build two separate .so libraries of Tensorflow. One will be with CUDA and other will be without. What we are not sure about is how to specify which cpp library to load at runtime of our java application ? Can you point to some example if you have any ?

@saudet
Copy link
Member

saudet commented Mar 28, 2017

Yes, two .so files. The user will decide which one to use, that's the point, no?

@saudet
Copy link
Member

saudet commented Mar 28, 2017

How to do this? We can simply call Loader.load() manually with the library that we want to load, which will bind the JNI functions, and that's it.

@saudet
Copy link
Member

saudet commented Mar 29, 2017

@akdeoras One more thing, to make sure that JavaCPP doesn't try to load libraries on its own, we'll need to suffix them with "#" in the class properties. This in effect marks that library as "system" or "provided". Specifically, something like this:

@Platform(... , link = "tensorflow_cc#", library = "tensorflow#")

@saudet
Copy link
Member

saudet commented Jun 3, 2017

@akdeoras Hi, have you made any progress on this? It looks like we're going to have the same problem with OpenCV (pull #416), so let's see what we can do together about this. There isn't any complete example for something like this. There are bits and pieces that need to be put together, and I will help. But please let me know where you stumble and I will help with those places in priority so we can even out the effort. Thanks for your interest!

/cc @SamCarlberg

@spi-x-i
Copy link

spi-x-i commented Sep 29, 2017

Hi all. Thank you for your great work :) .
I'm struggling to run javacpp-presets exampletrainer.
I define the following POM file

<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.bytedeco.javacpp-presets.tensorflow</groupId>
    <artifactId>exampletrainer</artifactId>
    <version>1.3</version>
    <properties>
        <exec.mainClass>ExampleTrainer</exec.mainClass>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.bytedeco.javacpp-presets</groupId>
            <artifactId>tensorflow-platform</artifactId>
            <version>1.0.1-1.3</version>
        </dependency>
    </dependencies>
</project>

So I go through all libraries but I stuck to this error

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fffcbd97b52, pid=6226, tid=0x0000000000004f03
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libsystem_c.dylib+0x1b52]  strlen+0x12
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

I also exported the env var export LD_LIBRARY_PATH=/usr/lib but still the sigsegv.

I run the example on mac osx 10.12.6. I saw really similar error above so I though to post my issue here. Thanks all,

Andrea

@saudet
Copy link
Member

saudet commented Oct 1, 2017

@spi-x-i Please try again with 1.3.0-1.3.4-SNAPSHOT and let me know if that doesn't work, thanks!
https://github.com/bytedeco/javacpp-presets/tree/master/tensorflow#the-pomxml-build-file

@bigfanofcpp
Copy link

<dependencies>
    <dependency>
        <groupId>org.bytedeco.javacpp-presets</groupId>
        <artifactId>tensorflow-platform</artifactId>
        <version>1.4.0-1.3.4-SNAPSHOT</version>
    </dependency>
</dependencies>

It is error version and not found, please help me

@saudet
Copy link
Member

saudet commented Dec 8, 2017

Make sure to follow the instructions here: http://bytedeco.org/builds/

@andyguest
Copy link

There don't seem to be any newer snapshots for tensorflow-platform at sonatype, despite attempting to follow your instructions above.

[ERROR] Failed to execute goal on project exampletrainer: Could not resolve dependencies for project org.bytedeco.javacpp-presets.tensorflow:exampletrainer:jar:1.3.4: Could not find artifact org.bytedeco.javacpp-presets:tensorflow-platform:jar:1.4.0-1.3-SNAPSHOT in sonatype-nexus-snapshots (https://oss.sonatype.org/content/repositories/snapshots)

@saudet
Copy link
Member

saudet commented Dec 13, 2017

@andyguest The version is 1.4.0-1.3.4-SNAPSHOT, you can check the list to make sure:
https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/javacpp-presets/tensorflow-platform/

@andyguest
Copy link

Fantastic. That works perfectly now! :-)

@saudet
Copy link
Member

saudet commented Dec 14, 2017

@saudet
Copy link
Member

saudet commented Dec 20, 2017

@akdeoras FYI, separate CPU-only and GPU-enabled builds are now a reality! In addition to tensorflow-platform, adding something like the following will download and use the binaries for CUDA:

        <dependency>
            <groupId>org.bytedeco.javacpp-presets</groupId>
            <artifactId>tensorflow</artifactId>
            <version>1.4.0-1.3.4-SNAPSHOT</version>
            <classifier>linux-x86_64-gpu</classifier>
        </dependency>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants