Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategies for getting Tensorflow-Java on Apple Silicon? #394

Closed
kgoderis opened this issue Nov 8, 2021 · 72 comments
Closed

Strategies for getting Tensorflow-Java on Apple Silicon? #394

kgoderis opened this issue Nov 8, 2021 · 72 comments

Comments

@kgoderis
Copy link

kgoderis commented Nov 8, 2021

Like some others I am in need to get Tensorflow-java running on an M1 based machine, certainly now that Apple has released a Tensorflow distribution for M1

[I know there is https://github.com//issues/252 but I want to revive the discussion after Apple's recent efforts]

Before even to attempt doing this I was wondering of any of the underlying strategies do make sense, or alternatively, do work

  1. Compile from source, target arm64 arch, using arm64 tools (e.g Bazel), run java.jar using a arm64 JVM like Zulu 8.58.0.13-CA-macos-aarch64

[This one fails based on the current HEAD. (java.lang.NoSuchMethodError: 'java.lang.Iterable com.sun.tools.javac.code.Scope$WriteableScope.getSymbolsByName(com.sun.tools.javac.util.Name, com.sun.tools.javac.util.Filter)'). It does not even gets to the TF native build phase]

  1. Compile from source using x86 tools (e.g. in a "arch -x86_64 zsh" shell), taking into account specific guidelines e.g. remove usage of specific instruction sets. Consequently, run the java.jar using a x86 JVM, e.g. thus under Rossetta

  2. Any other angle to look at the problem ?

[For that matter, how to leverage other ML frameworks on M1, e.g deeplearning4j ?]

@kgoderis
Copy link
Author

kgoderis commented Nov 8, 2021

Self-note. It seems that the above error is due to me building against a 1.8 JDK, instead of something more recent

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 8, 2021

Don't try and compile TF-Java using Rosetta, you'll pull in a TF binary which has AVX instructions which will cause a SIGILL and take down the JVM.

I've not tried to compile it on an M1 since we bumped to TF 2.6.0 and made some build changes, I can take a look at doing that. Theoretically you should be able to run mvn package and have it build everything, but I think you'll need to be in a venv which has a version of numpy installed, and be running bazel natively rather than via Rosetta. After that it comes down to a bunch of weird configuration things in bazel which we might not be patching appropriately.

As for other ML frameworks, I've personally got XGBoost and ONNX Runtime working in Java on an M1 Mac and contributed any fixes back upstream. We had ONNX Runtime working a month or two after the M1 came out. Anything that's in pure Java will work just fine on an M1, but I've not looked at dl4j or djl which both have large native libraries inside.

@kgoderis
Copy link
Author

kgoderis commented Nov 8, 2021

@Craigacp Building under Rosetta but using a TF build config file without the AVX instructions is not an option then ? I was not aware that it is pulling a TF binary, I was under the impression that it pulls the TF repo and does compile TF as part of the TF-J build process.

@kgoderis
Copy link
Author

kgoderis commented Nov 8, 2021

@Craigacp Any pointer on how to get ONNX going, because the this what I get on the home page? LOL
image

[Edit : I presume you did some cross-compiling to get it work . Going through the docs right now...]

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 8, 2021

@Craigacp Building under Rosetta but using a TF build config file without the AVX instructions is not an option then ? I was not aware that it is pulling a TF binary, I was under the impression that it pulls the TF repo and does compile TF as part of the TF-J build process.

Java is slow under Rosetta as it messes with the JIT. You could compile TF without AVX support under Rosetta, but it would probably be fairly slow, and at that point I'm not sure what the utility of it is.

@Craigacp Any pointer on how to get ONNX going, because the this what I get on the home page? LOL image

[Edit : I presume you did some cross-compiling to get it work . Going through the docs right now...]

I've not tried cross-compiling. Checkout the ONNX Runtime repo on a M1 Mac and then compile it as normal for java ./build.sh --update --build --config Release --parallel --build_java --test.

@kgoderis
Copy link
Author

kgoderis commented Nov 8, 2021

Java is slow under Rosetta as it messes with the JIT. You could compile TF without AVX support under Rosetta, but it would probably be fairly slow, and at that point I'm not sure what the utility of it is.

Well, my main dev machine is now an M1, obviously. So, the utility lies in developing TF models locally, but consequently then train them on a TPU/x86 cloud-based machine. I just want to avoid any pain in my development process.

In order to compile it under Rosetta I presume that I need an x86 JVM installed on top of other x86 tools like Bazel, right ?

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 8, 2021

Yes, you'll need a full x86 development stack, including Python, probably including compilers as well, and then you might need to change how it finds the compilers to make sure it picks the x86 ones.

@saudet
Copy link
Contributor

saudet commented Nov 8, 2021

Some people seem to be able to get arm64 binaries for TF 2.6.0, for example, see tensorflow/tensorflow#52160 (comment).
If that is true, running the build for TF Java with a command line this should work:

BUILD_FLAGS="--cpu=darwin_arm64 --host-cpu=darwin_arm64" mvn clean install

@kgoderis
Copy link
Author

@saudet That did not work unfortunately.

I am able to compile the Tensorflow repo (tensorflow/tensorflow#52160 (comment)), but then Tf-J fails with

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project tensorflow-core-generator: Compilation failure
[ERROR] /Users/kgoderis/Development/tensorflow-java/tensorflow-core/tensorflow-core-generator/src/main/java/org/tensorflow/proto/framework/OpListOrBuilder.java:[23,7] error: An unhandled exception was thrown by the Error Prone static analysis plugin.
[ERROR]      Please report this at https://github.com/google/error-prone/issues/new and include the following:
[ERROR]   
[ERROR]      error-prone version: 2.6.0
[ERROR]      BugPattern: JavaLangClash
[ERROR]      Stack Trace:
[ERROR]      java.lang.NoSuchMethodError: 'java.lang.Iterable com.sun.tools.javac.code.Scope$WriteableScope.getSymbolsByName(com.sun.tools.javac.util.Name, com.sun.tools.javac.util.Filter)'
[ERROR]   	at com.google.errorprone.bugpatterns.JavaLangClash.check(JavaLangClash.java:66)
[ERROR]   	at com.google.errorprone.bugpatterns.JavaLangClash.matchClass(JavaLangClash.java:53)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.processMatchers(ErrorProneScanner.java:450)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.visitClass(ErrorProneScanner.java:548)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.visitClass(ErrorProneScanner.java:151)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.tree.JCTree$JCClassDecl.accept(JCTree.java:860)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreePathScanner.scan(TreePathScanner.java:86)
[ERROR]   	at com.google.errorprone.scanner.Scanner.scan(Scanner.java:74)
[ERROR]   	at com.google.errorprone.scanner.Scanner.scan(Scanner.java:48)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreeScanner.scan(TreeScanner.java:111)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreeScanner.scanAndReduce(TreeScanner.java:119)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreeScanner.visitCompilationUnit(TreeScanner.java:152)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.visitCompilationUnit(ErrorProneScanner.java:561)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScanner.visitCompilationUnit(ErrorProneScanner.java:151)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.tree.JCTree$JCCompilationUnit.accept(JCTree.java:614)
[ERROR]   	at jdk.compiler/com.sun.source.util.TreePathScanner.scan(TreePathScanner.java:60)
[ERROR]   	at com.google.errorprone.scanner.Scanner.scan(Scanner.java:58)
[ERROR]   	at com.google.errorprone.scanner.ErrorProneScannerTransformer.apply(ErrorProneScannerTransformer.java:43)
[ERROR]   	at com.google.errorprone.ErrorProneAnalyzer.finished(ErrorProneAnalyzer.java:152)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.api.MultiTaskListener.finished(MultiTaskListener.java:132)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.flow(JavaCompiler.java:1394)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.flow(JavaCompiler.java:1341)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.compile(JavaCompiler.java:933)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.Main.compile(Main.java:317)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.main.Main.compile(Main.java:176)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.Main.compile(Main.java:64)
[ERROR]   	at jdk.compiler/com.sun.tools.javac.Main.main(Main.java:50)
[ERROR] 

On the other hand, diving into the ./tensorflow-core/tensorflow-core-api where the TF core should be built, I was able to start the compilation (sudo bazel build --config opt --cpu=darwin_arm64 --host_cpu=darwin_arm64 --incompatible_restrict_string_escapes=false --experimental_repo_remote_exec --define=ABSOLUTE_JAVABASE=/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home --host_javabase=@bazel_tools//tools/jdk:absolute_javabase //:all), but is soon exited with errors like these:

external/org_tensorflow/tensorflow/core/platform/default/port.cc:360:14: error: no matching constructor for initialization of 'tensorflow::port::MemoryInfo'
  MemoryInfo mem_info = {INT64_MAX, INT64_MAX};
             ^          ~~~~~~~~~~~~~~~~~~~~~~
external/org_tensorflow/tensorflow/core/platform/mem.h:62:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 2 were provided
struct MemoryInfo {
       ^
external/org_tensorflow/tensorflow/core/platform/mem.h:62:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 2 were provided
external/org_tensorflow/tensorflow/core/platform/mem.h:62:8: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 2 were provided
external/org_tensorflow/tensorflow/core/platform/default/port.cc:373:23: error: no matching constructor for initialization of 'tensorflow::port::MemoryBandwidthInfo'
  MemoryBandwidthInfo membw_info = {INT64_MAX};
                      ^            ~~~~~~~~~~~
external/org_tensorflow/tensorflow/core/platform/mem.h:67:8: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'long long' to 'const tensorflow::port::MemoryBandwidthInfo' for 1st argument
struct MemoryBandwidthInfo {
       ^
external/org_tensorflow/tensorflow/core/platform/mem.h:67:8: note: candidate constructor (the implicit move constructor) not viable: no known conversion from 'long long' to 'tensorflow::port::MemoryBandwidthInfo' for 1st argument
external/org_tensorflow/tensorflow/core/platform/mem.h:67:8: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 1 was provided

@kgoderis
Copy link
Author

Update : Bumping Google's errorprone to <errorprone.version>2.10.0</errorprone.version> fixes this error

@kgoderis
Copy link
Author

kgoderis commented Nov 14, 2021

Update: Changing .bazelrc in tensorflow-core/tensorflow-core-api to

build --remote_cache=https://storage.googleapis.com/tensorflow-sigs-jvm
build --remote_upload_local_results=false
build --action_env PYTHON_BIN_PATH="/Users/kgoderis/miniforge3/bin/python3"
build --action_env PYTHON_LIB_PATH="/Users/kgoderis/miniforge3/lib/python3.9/site-packages"
build --python_path="/Users/kgoderis/miniforge3/bin/python3"
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-Wno-sign-compare
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test:v1 --test_tag_filters=-benchmark-test,-no_oss,-gpu,-nomac,-no_mac,-oss_serial
test:v1 --build_tag_filters=-benchmark-test,-no_oss,-gpu,-nomac,-no_mac
test:v2 --test_tag_filters=-benchmark-test,-no_oss,-gpu,-nomac,-no_mac,-oss_serial,-v1only
test:v2 --build_tag_filters=-benchmark-test,-no_oss,-gpu,-nomac,-no_mac,-v1only
build --incompatible_restrict_string_escapes=false 
build --experimental_repo_remote_exec
build --define=ABSOLUTE_JAVABASE=/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home
build --host_javabase=@bazel_tools//tools/jdk:absolute_javabase 

In addition I bumped .bazelversion to 4.2.1 and changed build.sh to make bazel run under sudo

gets the compilation of that maven compile unit going. Still contains references to the setup on my dev machine, but we are advancing ;-)

@kgoderis
Copy link
Author

kgoderis commented Nov 14, 2021

Some of you will be happy. I got the whole thing compiled, however I had to skip tests as it was failing on that part, and there were some warnings on TARGET_OS_IPHONE. Apart from that, it kinda looks good:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for TensorFlow Java Parent 0.4.0-SNAPSHOT:
[INFO] 
[INFO] TensorFlow Java Parent ............................. SUCCESS [  0.476 s]
[INFO] TensorFlow Core Parent ............................. SUCCESS [  0.009 s]
[INFO] TensorFlow Core Generators ......................... SUCCESS [  0.277 s]
[INFO] TensorFlow Core API Library ........................ SUCCESS [ 48.684 s]
[INFO] TensorFlow Core API Library Platform ............... SUCCESS [  0.020 s]
[INFO] TensorFlow Framework Library ....................... SUCCESS [  0.069 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  49.599 s
[INFO] Finished at: 2021-11-14T13:39:12+01:00
[INFO] ------------------------------------------------------------------------
(base) kgoderis@Karels-M1-MacBook-Pro target % pwd
/Users/kgoderis/Development/tensorflow-java/tensorflow-core/tensorflow-core-api/target
(base) kgoderis@Karels-M1-MacBook-Pro target % ls -la 
total 180432
drwxr-xr-x  12 root      staff       384 Nov 14 13:19 .
drwxr-xr-x  17 kgoderis  staff       544 Nov 14 13:16 ..
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 classes
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 generated-sources
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 generated-test-sources
drwxr-xr-x   3 root      staff        96 Nov 14 13:17 maven-archiver
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 maven-status
drwxr-xr-x   3 root      staff        96 Nov 14 13:16 native
drwxr-xr-x  82 root      staff      2624 Nov 14 13:19 surefire-reports
-rw-r--r--   1 root      staff  78551487 Nov 14 13:32 tensorflow-core-api-0.4.0-SNAPSHOT-macosx-arm64.jar
-rw-r--r--   1 root      staff   8245523 Nov 14 13:32 tensorflow-core-api-0.4.0-SNAPSHOT.jar
drwxr-xr-x   5 root      staff       160 Nov 14 13:17 test-classes
(base) kgoderis@Karels-M1-MacBook-Pro tensorflow % pwd
/Users/kgoderis/Development/tensorflow-java/tensorflow-core/tensorflow-core-api/bazel-bin/external/org_tensorflow/tensorflow
(base) kgoderis@Karels-M1-MacBook-Pro tensorflow % file libtensorflow_framework.2.6.0.dylib
libtensorflow_framework.2.6.0.dylib: Mach-O 64-bit dynamically linked shared library arm64

@kgoderis
Copy link
Author

kgoderis commented Nov 15, 2021

The build fails with Java 8 (arm64)

ERROR: /private/var/tmp/_bazel_root/6712ec151cb8fc337cc5082ff0f496e3/external/bazel_tools/tools/jdk/BUILD:346:14: Action external/bazel_tools/tools/jdk/platformclasspath.jar failed: (Exit 1): java failed: error executing command 
  (cd /private/var/tmp/_bazel_root/6712ec151cb8fc337cc5082ff0f496e3/execroot/tensorflow_core_api && \
  exec env - \
  /Library/Java/JavaVirtualMachines/zulu-8-arm64.jdk/Contents/Home/bin/java -XX:+IgnoreUnrecognizedVMOptions '--add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.platform=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED' -cp bazel-out/darwin_arm64-opt/bin/external/bazel_tools/tools/jdk/platformclasspath_classes:/Library/Java/JavaVirtualMachines/zulu-8-arm64.jdk/Contents/Home/lib/tools.jar DumpPlatformClassPath bazel-out/darwin_arm64-opt/bin/external/bazel_tools/tools/jdk/platformclasspath.jar external/local_jdk)
Execution platform: @local_execution_config_platform//:platform
Exception in thread "main" java.lang.AssertionError: 
Could not find java.lang.Object on bootclasspath; something has gone terribly wrong.
Please file a bug: https://github.com/bazelbuild/bazel/issues
	at DumpPlatformClassPath.writeEntries(DumpPlatformClassPath.java:136)
	at DumpPlatformClassPath.writeClassPathJars(DumpPlatformClassPath.java:174)
	at DumpPlatformClassPath.dumpJDK8BootClassPath(DumpPlatformClassPath.java:77)
	at DumpPlatformClassPath.main(DumpPlatformClassPath.java:65)
ERROR: /private/var/tmp/_bazel_root/6712ec151cb8fc337cc5082ff0f496e3/external/com_google_protobuf/BUILD:290:15 Building external/com_google_protobuf/libany_proto-speed.jar (1 source jar) failed: (Exit 1): java failed: error executing command 

[Update]
It seems one has to be explicit about the toolchain in .bazelrc. I added

build --host_javabase=@bazel_tools//tools/jdk:absolute_javabase
build --javabase=@bazel_tools//tools/jdk:absolute_javabase
build --host_java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8
build --java_toolchain=@bazel_tools//tools/jdk:toolchain_hostjdk8

but unfortunately it fails because in pom.xml we use --add-exports flags for the JVM, which is not supported by the 1.8 JDK. I am not sure why we need these flags in the first place ( I know what the flag is supposed to do), and therefore, what could be a workaround solution. Anyone?

@Craigacp
Copy link
Collaborator

If it builds with 11 why do you need to build it with 8? It should produce Java 8 compatible jar files even when compiled on 11.

@kgoderis
Copy link
Author

If it builds with 11 why do you need to build it with 8? It should produce Java 8 compatible jar files even when compiled on 11.

Because I want to integrate this in a project which uses Spark NLP, and that only runs on a Java 8 VM. As far as I understand, Java 11 compiled jars do not run ok older JVMs

@Craigacp
Copy link
Collaborator

TF-Java is compiled on 11 but targets 8, and so will produce class files which are compatible with Java 8.

@kgoderis
Copy link
Author

kgoderis commented Nov 15, 2021

TF-Java is compiled on 11 but targets 8, and so will produce class files which are compatible with Java 8.

Ah... I was not aware of this. That means we are good to go. Will you pick up what we did and get the jars onto sonatype?

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 15, 2021

I'm trying to replicate what you have on my M1 Mac so I can figure out what the test failures are, but I'm getting issues compiling protobuf.

We can't easily deploy to Maven Central as our builds are done through Github Actions and they don't have any Apple Silicon runners.

@kgoderis
Copy link
Author

kgoderis commented Nov 15, 2021

@Craigacp I think I solved that by altering .bazelrc cfr #394 (comment)

or this

build --define=ABSOLUTE_JAVABASE=/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home
build --host_javabase=@bazel_tools//tools/jdk:absolute_javabase

Not sure in fact, I did many things and document only half of it

@Craigacp
Copy link
Collaborator

I had to make some modifications to the pom files to get it to build the appropriate jars, and having to run the build as the superuser is worrying to me. I also had to add my JVM as a bazel build flag too. However I didn't need to do anything else to my .bazelrc other than build --incompatible_restrict_string_escapes=false. My build does now pass all the tests with those modifications.

We're upgrading to TF 2.7.0 at the moment, I'll rerun the build after that has merged in as it might fix some of the issues.

@kgoderis
Copy link
Author

If I remember well the tests failed on a mismatch of dimension on the input matrix on a NN layer. Mind you that I tried to compile against Java 8 cfr my misunderstanding.

@Craigacp
Copy link
Collaborator

Could you try and build a clean checkout of this branch - https://github.com/Craigacp/tensorflow-java/tree/apple-silicon ? It'll require sudo, and I don't want that in an actual build, but it would be a useful check if someone else can build it.

@kgoderis
Copy link
Author

@Craigacp Trying to do so. However, need Google Error Prone bumped to 2.10.0, and what about Bazel? 3.7.2 or 4.2.1 ?

@kgoderis
Copy link
Author

In fact, I remember I went for Bazel 4.2.1 because there are no pre-compiled Bazel binaries For MacOS arm64, and I wanted to avoid to Compile Bazel from source

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 15, 2021

Error prone should work on Java 11. That build is hard coded to expect a Azul Zulu 11 installed in the system. The bazel version is set to 4.2.1 and the whole thing should build with mvn clean package without other modifications, the same way the x86 builds do.

@rnett
Copy link
Contributor

rnett commented Nov 16, 2021

Are you still trying to use Bazel 4.2.1? Because that's what the error is complaining about. We require 3.7.2 because tensorflow does.

@Craigacp
Copy link
Collaborator

I set the bazelversion to 4.2.1. Maybe it's not cleaned the build properly?

@kgoderis
Copy link
Author

It does not work

  • Zulu 11 + Error prone 2.6.0 -> error
  • still needs sudo
  • Bazel 4.2.1 needed as no arm binaries for 3.7.2 are available

@kgoderis
Copy link
Author

Are you still trying to use Bazel 4.2.1? Because that's what the error is complaining about. We require 3.7.2 because tensorflow does.

Yes, but I got the whole thing compiled with 4.2.1 last weekend

@saudet
Copy link
Contributor

saudet commented Nov 16, 2021

I had to make some modifications to the pom files to get it to build the appropriate jars,

Ah, yes, we'll need to update the profiles in the pom.xml files a bit like pull bytedeco/javacpp-presets#1092 for this to work. Are you saying you've already done this? Or should I do it?

@DevinTDHa
Copy link

@Craigacp

I was able to resolve it by downgrading command line tools and Xcode to version 13.1.6! Goes through like before.

@karllessard
Copy link
Collaborator

@DevinTDHa , you just saved my life. I've been trying to build TF 2.10 on my M1 for awhile and was blocked by this malformed trie error as well. Looks like downgrading Xcode from 14.x to 13.x did the job!

Other than that, building 2.10 is pretty straightforward and I've fixed also how the op exporter is linking to TF. I would like to update TF-Java repo so that the latest snapshot can be easily build on M1 machines. @saudet , I've also updated JavaCPP to 1.5.8. Still, I'm facing some new problems that don't seem related to M1 this time but maybe more on 2.10, I'll take a look later but any advice from you would be more than welcome:

/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/native/org/tensorflow/internal/c_api/macosx-arm64/jnitensorflow.cpp:3380:47: error: no matching constructor for initialization of 'SpanAdapter<tensorflow::SourceLocation>'
    SpanAdapter< tensorflow::SourceLocation > radapter(ptr->GetSourceLocations());
                                              ^        ~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/classes/org/tensorflow/internal/c_api/include/tensorflow_adapters.h:19:28: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'absl::Span<const SourceLocation>' to 'const SpanAdapter<tensorflow::SourceLocation>' for 1st argument
template<typename T> class SpanAdapter {
                           ^
/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/classes/org/tensorflow/internal/c_api/include/tensorflow_adapters.h:19:28: note: candidate constructor (the implicit move constructor) not viable: no known conversion from 'absl::Span<const SourceLocation>' to 'SpanAdapter<tensorflow::SourceLocation>' for 1st argument
/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/classes/org/tensorflow/internal/c_api/include/tensorflow_adapters.h:23:5: note: candidate constructor not viable: no known conversion from 'Span<const tensorflow::SourceLocation>' to 'const Span<tensorflow::SourceLocation>' for 1st argument
    SpanAdapter(const Span<T>& arr) : ptr(0), size(0), owner(0), arr2(arr), arr(arr2) { }
    ^
/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/classes/org/tensorflow/internal/c_api/include/tensorflow_adapters.h:24:5: note: candidate constructor not viable: no known conversion from 'absl::Span<const SourceLocation>' to 'Span<tensorflow::SourceLocation> &' for 1st argument
    SpanAdapter(      Span<T>& arr) : ptr(0), size(0), owner(0), arr(arr) { }
    ^
/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/classes/org/tensorflow/internal/c_api/include/tensorflow_adapters.h:25:5: note: candidate constructor not viable: no known conversion from 'absl::Span<const SourceLocation>' to 'const Span<tensorflow::SourceLocation> *' for 1st argument
    SpanAdapter(const Span<T>* arr) : ptr(0), size(0), owner(0), arr(*(Span<T>*)arr) { }
    ^
/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/classes/org/tensorflow/internal/c_api/include/tensorflow_adapters.h:21:5: note: candidate constructor not viable: requires 3 arguments, but 1 was provided
    SpanAdapter(T const * ptr, typename Span<T>::size_type size, void* owner) : ptr((T*)ptr), size(size), owner(owner),

@saudet
Copy link
Contributor

saudet commented Nov 27, 2022

Yeah, we should do version upgrades for TF Core separately. Put that in a branch and I'll take a look at it.

@saudet
Copy link
Contributor

saudet commented Nov 27, 2022

Based on that error message, something like this should fix that one though:

.put(new Info("absl::Span<const tensorflow::SourceLocation>").annotations("@Span")
                                                             .valueTypes("@Cast(\"const tensorflow::SourceLocation*\") SourceLocation")
                                                             .pointerTypes("SourceLocation"))

@karllessard
Copy link
Collaborator

Based on that error message, something like this should fix that one though:

.put(new Info("absl::Span<const tensorflow::SourceLocation>").annotations("@Span")
                                                             .valueTypes("@Cast(\"const tensorflow::SourceLocation*\") SourceLocation")
                                                             .pointerTypes("SourceLocation"))

Yep, that did the trick, thanks! Still, hitting issues now when JavaCPP tries to load the libjnitensorflow library. JavaCPP fails with the following debug trace.

Debug: Locking /Users/klessard/.javacpp/cache before extracting
Debug: Extracting jar:file:/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar!/org/tensorflow/internal/c_api/macosx-arm64/libtensorflow_framework.2.dylib
Debug: Loading /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libtensorflow_framework.2.dylib
Debug: Locking /Users/klessard/.javacpp/cache before extracting
Debug: Extracting jar:file:/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar!/org/tensorflow/internal/c_api/macosx-arm64/libtensorflow_cc.2.dylib
Debug: Loading /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libtensorflow_cc.2.dylib
Debug: Locking /Users/klessard/.javacpp/cache before extracting
Debug: Extracting jar:file:/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar!/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib
Debug: Loading /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib
Debug: Failed to load /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib: java.lang.UnsatisfiedLinkError: Can't load library: /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib
Debug: Loading library jnitensorflow
Debug: Failed to load for jnitensorflow: java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path: [/Users/klessard/Library/Java/Extensions, /Library/Java/Extensions, /Network/Library/Java/Extensions, /System/Library/Java/Extensions, /usr/lib/java, .]
[ERROR] Tests run: 17, Failures: 0, Errors: 15, Skipped: 0, Time elapsed: 4.74 s <<< FAILURE! - in org.tensorflow.TensorTest
[ERROR] org.tensorflow.TensorTest.createFromBufferWithNonNativeByteOrder  Time elapsed: 4.698 s  <<< ERROR!
java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path: [/Users/klessard/Library/Java/Extensions, /Library/Java/Extensions, /Network/Library/Java/Extensions, /System/Library/Java/Extensions, /usr/lib/java, .]
	at org.tensorflow.TensorTest.createFromBufferWithNonNativeByteOrder(TensorTest.java:134)
Caused by: java.lang.UnsatisfiedLinkError: Can't load library: /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib
	at org.tensorflow.TensorTest.createFromBufferWithNonNativeByteOrder(TensorTest.java:134)

The JNI library is there and looks ok, here's the output of its otool -L:

/Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib:
	libjnitensorflow.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libtensorflow_cc.2.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1300.23.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.100.3)

I'll try to debug that but I've pushed too a temporary branch with my actual code, so if you are interested to give it a try @saudet . Don't forget that only works apparently on Xcode CL Tools 13.x, which might require to downgrade

@saudet
Copy link
Contributor

saudet commented Nov 28, 2022

I don't have access to a Mac like that to check it out, but make sure with, for example, otool -hv -arch all that the library actually gets compiled for the right architecture.

@karllessard
Copy link
Collaborator

karllessard commented Dec 14, 2022

Ok, coming back on this. It seems it is yes:

/Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib:
Mach header
      magic  cputype cpusubtype  caps    filetype ncmds sizeofcmds      flags
MH_MAGIC_64    ARM64        ALL  0x00       DYLIB    20       1840   NOUNDEFS DYLDLINK TWOLEVEL WEAK_DEFINES BINDS_TO_WEAK NO_REEXPORTED_DYLIBS

@karllessard
Copy link
Collaborator

karllessard commented Dec 14, 2022

Also if I understand JavaCPP debug traces correctly, TF cc and framework libraries have been loaded correctly:

Debug: Loading /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libtensorflow_framework.2.dylib
Debug: Loading /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libtensorflow_cc.2.dylib
Debug: Loading /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib
Debug: Failed to load /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib: java.lang.UnsatisfiedLinkError: Can't load library: /Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libjnitensorflow.dylib
Debug: Loading library jnitensorflow
Debug: Failed to load for jnitensorflow: java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path: [/Users/klessard/Library/Java/Extensions, /Library/Java/Extensions, /Network/Library/Java/Extensions, /System/Library/Java/Extensions, /usr/lib/java, .]

java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path: [/Users/klessard/Library/Java/Extensions, /Library/Java/Extensions, /Network/Library/Java/Extensions, /System/Library/Java/Extensions, /usr/lib/java, .]

I don't know if that's relevant but these libraries have an extra flag MH_HAS_TLV_DESCRIPTORS on them using the previous otool command:

/Users/klessard/.javacpp/cache/tensorflow-core-api-0.5.0-SNAPSHOT-macosx-arm64.jar/org/tensorflow/internal/c_api/macosx-arm64/libtensorflow_framework.2.dylib:
Mach header
      magic  cputype cpusubtype  caps    filetype ncmds sizeofcmds      flags
MH_MAGIC_64    ARM64        ALL  0x00       DYLIB    22       2504   NOUNDEFS DYLDLINK TWOLEVEL WEAK_DEFINES BINDS_TO_WEAK NO_REEXPORTED_DYLIBS MH_HAS_TLV_DESCRIPTORS

@karllessard
Copy link
Collaborator

@saudet , anymore guidance you can provide on this? I confirm that binaries build with Bazel (libtensorflow_framework and libtensorflow_cc) loads successfully when calling Java System.load() but the one compiled by JavaCPP (libjnitensorflow) does not... I don't know, am I the only having this issue? Here's the compile options used by JavaCPP if that can help:

clang++ -I/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api \
-I/private/var/tmp/_bazel_klessard/a8a13bbdeba26dffed02fff40a88f8c2/external/org_tensorflow \
-I/private/var/tmp/_bazel_klessard/a8a13bbdeba26dffed02fff40a88f8c2/execroot/tensorflow_core_api/bazel-out/darwin_arm64-opt/bin/external/org_tensorflow \
-I/private/var/tmp/_bazel_klessard/a8a13bbdeba26dffed02fff40a88f8c2/external/com_google_absl \
-I/private/var/tmp/_bazel_klessard/a8a13bbdeba26dffed02fff40a88f8c2/external/eigen_archive \
-I/private/var/tmp/_bazel_klessard/a8a13bbdeba26dffed02fff40a88f8c2/external/com_google_protobuf/src \
-I/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/classes/org/tensorflow/internal/c_api/include \
-I/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include/darwin \
-I/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include \
/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/native/org/tensorflow/internal/c_api/macosx-arm64/jnitensorflow.cpp \
/Users/klessard/Documents/Projects/ML/Sources/TensorFlow/tensorflow-java/tensorflow-core/tensorflow-core-api/target/native/org/tensorflow/internal/c_api/macosx-arm64/jnijavacpp.cpp \
-O3 -std=c++14 -arch arm64 -Wl,-rpath,@loader_path/. -Wall -fPIC -pthread -dynamiclib -undefined dynamic_lookup \
-o libjnitensorflow.dylib \
-L/private/var/tmp/_bazel_klessard/a8a13bbdeba26dffed02fff40a88f8c2/execroot/tensorflow_core_api/bazel-out/darwin_arm64-opt/bin/external/org_tensorflow/tensorflow \
-Wl,-rpath,/private/var/tmp/_bazel_klessard/a8a13bbdeba26dffed02fff40a88f8c2/execroot/tensorflow_core_api/bazel-out/darwin_arm64-opt/bin/external/org_tensorflow/tensorflow \
-ltensorflow_cc

@karllessard
Copy link
Collaborator

Or could it be related to that change? #394 (comment)

@saudet
Copy link
Contributor

saudet commented Dec 15, 2022

I don't know if that's relevant but these libraries have an extra flag MH_HAS_TLV_DESCRIPTORS on them using the previous otool command:

That seems to get added when there are thread local variables:
https://opensource.apple.com/source/dyld/dyld-239.3/src/threadLocalVariables.c.auto.html

Or could it be related to that change? #394 (comment)

No, I don't see how that's related?

Since I don't have such a Mac here, that's hard or me to debug, but many users are using other presets for Mac on ARM with no problems at all: bytedeco/javacpp-presets#1069

@saudet
Copy link
Contributor

saudet commented Dec 15, 2022

Where does "-undefined dynamic_lookup" get added? You might want to remove that and see if any important looking symbols are missing.

@karllessard
Copy link
Collaborator

Actually, I have absolutely no clue where that argument is coming from, the config of the JavaCPP task triggering this build is simply this:

            <id>javacpp-compiler</id>
            <phase>process-classes</phase>
            <goals>
              <goal>build</goal>
            </goals>
            <configuration>
              <outputDirectory>${project.build.directory}/native/org/tensorflow/internal/c_api/${native.classifier}/</outputDirectory>
              <skip>${javacpp.compiler.skip}</skip>
              <classOrPackageName>org.tensorflow.internal.c_api.**</classOrPackageName>
              <copyLibs>true</copyLibs>
              <copyResources>true</copyResources>
            </configuration>

@saudet
Copy link
Contributor

saudet commented Dec 15, 2022

Ah, I was looking in the wrong place. JavaCPP adds it automatically so that it behaves more like Linux by default:
https://github.com/bytedeco/javacpp/blob/master/src/main/resources/org/bytedeco/javacpp/properties/macosx-arm64.properties
Try to manually rerun that clang++ ... command by copy/pasting without -undefined dynamic_lookup see if it complains.

@karllessard
Copy link
Collaborator

karllessard commented Dec 15, 2022

Ok.... so removing -undefined dynamic_lookup fails to compile as it does not find a lot of the TF symbols, that can definitely explain it. Inspecting the TF binaries, I can notice that the missing symbols are those located in the libtensorflow_framework library. Looks like they are not being resolved transitively by just linking to libtensorflow_cc, even if it depends on it:

libtensorflow_cc.2.10.0.dylib:
	@rpath/libtensorflow_cc.2.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libtensorflow_framework.2.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.100.3)
	/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 1858.112.0)
	/System/Library/Frameworks/SystemConfiguration.framework/Versions/A/SystemConfiguration (compatibility version 1.0.0, current version 1163.100.19)
	/System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 1.0.0, current version 60158.100.133)
	/System/Library/Frameworks/IOKit.framework/Versions/A/IOKit (compatibility version 1.0.0, current version 275.0.0)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1300.23.0)

If I add -ltensorflow_framework to the clang++ command, only one symbol remains unresolved, I'll need to investigate more one this one:

Undefined symbols for architecture arm64:
  "tensorflow::Node::RunForwardTypeInference()", referenced from:
      _Java_org_tensorflow_internal_c_1api_Node_RunForwardTypeInference in jnitensorflow-a42ccb.o

@saudet
Copy link
Contributor

saudet commented Dec 15, 2022

Doesn't look too important. Try to move "tensorflow_framework@.2" from the "preload" list to "link" see if that works:
https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/presets/tensorflow.java#L70

@mattdornfeld
Copy link

@kgoderis am I correct in understanding you got the build to work on M1 with Bazel? Would you be able to share that code?

@karllessard
Copy link
Collaborator

Oh, good news everyone!!

So @saudet, effectively moving libtensorflow_framework from preload to link solved the linkage problem during compilation (btw, can I safely comment out what we currently have in the preload list since we don't support MKL anymore?)

Now, I also had to solve the remaining missing symbol "that didn't look too important" by skipping tensorflow::Node::RunForwardTypeInference from the JavaCPP preset, all tests are passing now!

... and that being said, I'm planning to push these changes to make the codebase in the TF Java repo compilable on M1 machines, and that also includes an upgrade to.... well, I was at 2.10.0 when I started but I should probably now upgrade to 2.11.0 😄

Just recalling that what @DevinTDHa said previously is still valid, we need to downgrade to Xcode 13.x to get this work until Apple fixes (hopefully) the "malformed trie" issue.

That's it, thanks everyone! I'll let you all know when that PR get merged.

@saudet
Copy link
Contributor

saudet commented Dec 16, 2022

So @saudet, effectively moving libtensorflow_framework from preload to link solved the linkage problem during compilation (btw, can I safely comment out what we currently have in the preload list since we don't support MKL anymore?)

Yes, those libraries are long deprecated, I don't think anyone uses them anymore.

@karllessard
Copy link
Collaborator

Ok, I'll do that in a separate PR... one day 😅

@mattomatic
Copy link

Hey @karllessard, as per usual, thank you for your work on this project. I am wondering if there has been progress on apple silicon support for this project?

Also, I noticed https://github.com/tensorflow/java/actions/runs/7013329954 the other day, which gave me a glint of hope that maybe we'd get apple silicon support AND a bump to TF 2.15, a potent and exciting combination of wins. However, it's unclear if that was just testing for another purpose or testing with the intention of making a new release of this library. Hope you are well.

@Craigacp
Copy link
Collaborator

Craigacp commented Dec 8, 2023

We're working on reducing the build process complexity and have added support for building macos arm64 jars locally without running bazel (so it's much simpler and likely to work without user intervention). It's not finished yet as we're hitting issues with Windows which either mean we need to still run a full bazel build on Windows, or we need to wait for Intel to fix the libtensorflow builds on Windows. We might merge it into master before then, but a release can't happen till we've figured out the Windows story. The branch where we're doing this work is running TF 2.15 and it'll either be that or 2.16 in the next release depending on the nature of the Windows fix.

@karllessard
Copy link
Collaborator

Thanks @mattomatic , +1 to everything that Adam mentioned! I believe we'll get that merged soon (with or without Windows)

@mattomatic
Copy link

Yay, that's awesome, thanks for the update. I don't fall into the camp who require windows support so it's relieving to know that things seem to work aside from that.

@mattifrind
Copy link

Hey, I've come across the same issue and wasn't able to downgrade XCode because of the new MacOS version (which worked before). I don't have any experience with jni so I had no idea what to try next, but I had luck changing the JDK from an aarch64 version to a 32Bit one (Termurin 21 without the aarch64 tag) and now it works! I don't know why but it does, maybe that's interesting for someone with more knowledge or anyone with the same issue who needs a solution fast.

@Craigacp
Copy link
Collaborator

Using an x86 build of TF-Java on Apple Silicon is a bad idea. Apple didn't implement support for AVX vector instructions in Rosetta, so it will crash the JVM whenever it tries to use them. This can be hard to predict as some codepaths may have a fallback to non-vector instructions, but when you hit one it'll cause a SIGILL and take the whole JVM down.

@Craigacp
Copy link
Collaborator

TF-Java 1.0.0-rc1 has Apple Silicon binaries - https://github.com/tensorflow/java/releases/tag/v1.0.0-rc.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests