Ensure the alternative ND manager can use GPUs #3138

david-sitsky · 2024-04-29T04:20:36Z

This makes a massive performance difference when using OnnxRuntime, which delegates to PyTorch for some NDArray operations. This almost makes OnnxRuntime run as fast compared to PyTorch. Without this change, OnnxRuntime was over 3 times slower.

frankfliu · 2024-04-29T16:15:46Z

api/src/main/java/ai/djl/ndarray/BaseNDManager.java

+                alternativeManager = engine.newBaseManager(device);
+            } catch (RuntimeException | UnsatisfiedLinkError ignore) {
+                // Use the default device instead.
+                alternativeManager = engine.newBaseManager();


It's arguable using the same device is the best. For simplicity, I prefer use default device directly.

The UnsatisfiedLinkError error you hit is a caching issue (only affect development time), not related here. You can just need to clear your rust JNI cache:

rm -rf ~/.djl.ai/tokenizer gradle :extensions:tokenizers:clean

This makes a massive performance difference when using OnnxRuntime, which delegates to PyTorch for some NDArray operations.

david-sitsky · 2024-04-30T01:52:02Z

@frankfliu - on my tests with my application (which does a lot of non-GPU work too), my original change ran in 6m 8s, but with the change not to keep the device, this went to 6m 25s. So it is a significant performance hit on multi-GPU systems, and for other workloads the discepancy will be much higher.

I also saw GPU 0 having way higher (too high) utilisation compared to the other GPUs since this is the default device set for the alternative ND manager, for all my worker processes (16 on my test).

Can't we keep my original change without the catch for UnsatisfiedLinkError? Otherwise this puts OnnxRuntime further away from PyTorch again, which won't have this issue.

frankfliu · 2024-04-30T01:55:13Z

@david-sitsky

You are right, in multiple GPU case, it does make sense to keep original device. I just doesn't like try catch. Let me think a bit, there should a simple way.

david-sitsky · 2024-05-01T00:00:54Z

@frankfliu - any ideas with this? I am pretty keen to get that multi-GPU performance back. 😉

frankfliu · 2024-05-01T16:34:41Z

@david-sitsky

I created a PR: #3146

david-sitsky · 2024-05-02T00:16:11Z

Awesome - thank you. That looks like a better way of achieving it too.

david-sitsky requested review from zachgk, frankfliu and a team as code owners April 29, 2024 04:20

david-sitsky mentioned this pull request Apr 29, 2024

Regression: EngineException: default_program(22): error: extra text after expected end of number with DJL 0.28.0 + intfloat/multilingual-e5-small on machine with GPU #3089

Open

frankfliu reviewed Apr 29, 2024

View reviewed changes

David Sitsky and others added 2 commits April 29, 2024 16:10

Ensure the alternative ND manager can use GPUs

8f900f0

This makes a massive performance difference when using OnnxRuntime, which delegates to PyTorch for some NDArray operations.

use default device for alternative manager

3cf7bf3

frankfliu force-pushed the alternative-engine-use-gpu branch from 012dd8d to 3cf7bf3 Compare April 29, 2024 23:11

zachgk approved these changes Apr 29, 2024

View reviewed changes

frankfliu merged commit 399b310 into deepjavalibrary:master Apr 29, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure the alternative ND manager can use GPUs #3138

Ensure the alternative ND manager can use GPUs #3138

david-sitsky commented Apr 29, 2024

frankfliu Apr 29, 2024 •

edited

Loading

david-sitsky commented Apr 30, 2024

frankfliu commented Apr 30, 2024

david-sitsky commented May 1, 2024

frankfliu commented May 1, 2024

david-sitsky commented May 2, 2024

Ensure the alternative ND manager can use GPUs #3138

Ensure the alternative ND manager can use GPUs #3138

Conversation

david-sitsky commented Apr 29, 2024

frankfliu Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

david-sitsky commented Apr 30, 2024

frankfliu commented Apr 30, 2024

david-sitsky commented May 1, 2024

frankfliu commented May 1, 2024

david-sitsky commented May 2, 2024

frankfliu Apr 29, 2024 •

edited

Loading