-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
7900 XTX Refuses to Run tensorflow-rocm Toy Example #1880
Comments
It's probably a packaging issue for Arch, try with p.s damn that GPU must be a beast 💯 |
Unfortunately that doesn't seem to work. First it tries to remove the conflicting packages:
However, answering Y to both question still results in a failure to install:
Are you sure these packages are even required though? From what I understand, tensorflow-rocm does NOT use opencl at all. As a matter of fact, I upgraded from a 6900XT which was able to run tensorflow-rocm with the exact same packages I have currently installed just fine. |
The package name is just that for historical reasons, nothing to do with OpenCL. The reason you get these conflicts errors is because it's not properly handling the conflicts. It's something I will try to fix soon but it's not there yet. So you have to manually remove any rocm-arch package yourself if you want to try p.s I don't want to spam the rocm issue tracker with arch packaging comments, so if you are still interested to try it feel free to comment on the AUR page and we can continue the discussion there. |
I just uninstalled all previous |
I guess it's because your GPU is not supported yet in ROCm. I ran your example with my 5700 XT and it's working fine (although it didn't complete in 10 minutes and I had to cancel it). Maybe you can try to |
That just makes it crash with an out of memory error, which is bogus for such a small example with 24GB memory:
|
7900xtx with rocm would be awsome!!!! @Mushoz do you get it working now? I have the same usecase |
The problem also occurs with 7900xt. Also with arch linux rocm packages from aur. Is there anything that can be done in order to make it run? Edit: I reproduced the same output with the samples/0_Intro/bit_extract in https://github.com/ROCm-Developer-Tools/HIP.git as an easier minimal example. |
So this means this problem are only exits on arch linux? And not on ubuntu or debian? |
Installing opencl-amd and opencl-amd-dev seems to work for me. @Mushoz Did you install llvm with version >= 15 (arch still has 14) You can also have a look at: There it states what is needed:
|
@jannesklee I am running llvm-minimal-git. Everything is working as it should game-wise. It's just that rocm is broken. Are you able to run the example in my first post just fine? And you are certain it's running on the GPU and not the CPU? Could you run the following python script and show the output?
|
I got the same error when testing the minimal example shown above, and other samples and it vanished when I used the other packages. When I check the usage with nvtop it shows me that the dedicated graphic card is in use. Maybe the llvm-minimal-git version is not enough. At https://aur.archlinux.org/pkgbase/llvm-git Lone_Wolf states that llvm-minimal-git focuses on providing stuff needed for AUR mesa-git. Doesn't support cross-compiling or any bindings for external stuff like ocaml & python. Unfortunately I am currently not capable to install tensorflow, because I get compilation errors, but this is something else I guess. I try to make it run but without success. |
@jannesklee no need to compile tensorflow. You can install |
My output is. I do not completely understand it to be honest..
|
Support for this GPU is not enabled on ROCm 5.4.1. Please await the 5.5.0 release announcement to check for support. |
When we can expect a release of 5.5.0 are there any date scheduled? |
@jannesklee I have the same output. Unfortunately it specifically states that it is ignoring the GPU because it is unsupported. @saadrahim when can we expect 5.5.0 to release? CUDA is so much easier in this regard. It just works. In order for ROCM to be able to compete with CUDA it really has to step up in terms of communication so that users can rely on ROCM as they can on CUDA |
I'm a bit surprised that you're having trouble with ROCm 5.4.1 on the 7900 XTX, as that architecture is gfx1100 and most of the AMD-provided binaries for ROCm 5.4.1 contain gfx1100 code objects. It's not listed as officially supported in the GPU support table for ROCm 5.4, but I would have expected it would mostly work anyway. Is this problem specific to Tensorflow? e.g., do other libraries packaged by Arch work? A quick check might be to build and run Arch's
When you set The RDNA 1 instruction sets are similar enough to the RDNA 2 instruction set that sometimes you can successfully run code that was compiled for RDNA 2 on an RDNA 1 GPU (as you are doing with your 5700 XT), however, this is not guaranteed to work. The instruction sets are not identical and if the code you're running happens to use an RDNA 2 instruction that worked differently in RDNA 1 (or doesn't exist at all in RDNA 1), then your program may not function correctly. Similarly, the RDNA 3 instruction sets are different from the RDNA 2 instruction set. If you try to run code compiled for RDNA 2 on an RDNA 3 GPU using |
My assumption is also that it is a problem from tensorflow side. I tested above the samples from https://github.com/ROCm-Developer-Tools/HIP Example bit_extract:
gives me
I can also see some activity with nvtop, but unfortunately I do not know exactly how to give more details here. Regarding your example I unfortunately get a core dump, when running ./test.sh:
|
@jannesklee I am not so sure. @saadrahim Specifically stated that ROCM 5.5.0 is required for these cards to run tensorflow. I am also not surprised you are able to run that HIP example. There is some preliminary support for the 7900 series, given that Blender can also use the HIP backend just fine: https://www.phoronix.com/review/rx7900-blender-opencl That has me thinking though. It would be interesting to see if pytorch-rocm is able to run. I can see that there are docker images available, and some tags are using rocm 5.4.1. That would take packaging issues AND tensorflow out of the equation, and would allow us to see if these cards are able to do any machine learning with the current rocm stack. I might try this out tonight. Docker images in case you want to give it a shot: https://hub.docker.com/r/rocm/pytorch/tags |
@jannesklee did it work ? |
@Mushoz pytorch-rocm doesn't appear to work, either. Can't find the GPU at all by default and segfaults with HSA_OVERRIDE_GFX_VERSION set. |
@wsippel Ah, I just replied to you on the AUR but only just now realized you are active here as well. A week ago changes for RDNA3 were merged for MIOpen: https://github.com/ROCmSoftwarePlatform/MIOpen/commits/develop See the 11th of January. Do you reckon we could get it to work by compiling MIOpen from source? |
@wsippel @Mushoz I can confirm that with some effort a build of pytorch 1.13.1 against AMD RX 7900 XTX with ROCm 5.4.2 works and is functional for my use case of running models. Rough outline for build is the usage of an Ubuntu (20.04/22.04) Docker image as AMD provides ROCm repos for it and installing all required deps without kernel module. See https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/Dockerfile#L67 basically edit 5.3 to 5.4.2 and run all commands till line 67. I also adapted the amdgpu install command to Maybe you can build tensorflow via instructions from https://www.tensorflow.org/install/source and adapting the build command to (in venv): |
@Kardi5 Would you mind sharing the final dockerfile that you used? I would love to try and replicate that for Tensorflow. Please leave in all the pytorch specific things as well. I will try to do something similar for Tensorflow. |
@Mushoz Sure, but I don't have a complete one myself right now. It was more of an interactive trial and error until all builds worked out. I hope to create a complete dockerfile tonight/tomorrow based on the notes I took. |
This issue also affects Gentoo when installing ROCm via portage. Installing clinfo: /var/tmp/portage/dev-libs/rocr-runtime-5.3.3/work/ROCR-Runtime-rocm-5.3.3/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char
*, rocr::AMD::GpuAgent::AssembleTarget, void *&, size_t &) const: Assertion `code_buf != NULL && "Code buffer allocation failed"' failed.
Aborted (core dumped) Im rather certain that this particular error is not related to TensorFlow or MIOpen, as I was able to repro the error above with only a basic installation of the Rocm OpenCL runtime and friends. The changes from ROCR 5.4.1 to 5.4.2 have not been downstreamed to GitHub yet, making it tricky to reproduce the workaround @Kardi5 proposed for other distros. I guess I'll try with 5.4.1 for now. |
@Mushoz So far I could only create a rough draft of a complete Dockerfile. Maybe you will find it useful nonetheless. Over at https://github.com/pytorch/pytorch/blob/master/.circleci/docker/ubuntu-rocm/Dockerfile there is a more complete example even though much more complex. Their Magma build script (https://github.com/pytorch/pytorch/blob/master/.circleci/docker/common/install_rocm_magma.sh) might be the solution to my troubles but I did not have time to look through it in more detail. There might still be errors besides Magma building after line Draft Torch + Torchvision Dockerfile
Build with Run interactively with: |
Can confirm that with rocr-runtime-5.4.1 # 5.4.2 not yet available.
roct-thunk-interface-5.4.2
rocm-opencl-runtime-5.4.2
rocm-comgr-5.4.2
rocm-device-libs-5.4.2 So this issue should originate from one of these libraries. The downside is that the gentoo Clang 16 toolchain is not able to build mesa due to rtti flag mismatch, so current usability may be limited. That's either a gentoo or mesa bug though. |
I was experimenting various things recently, and it seems like Navi 3x performance still has a lot of room for improvement. You might see some improvements on Navi 3x, but most of them are for MI GPUs. |
I'm looking forward to this: ROCm/flash-attention#1 |
Yeah. It seems to work on MI GPUs and the numbers look promising. I merged two branches in Composable Kernel for it to support Navi 31 yesterday, but haven't got it to work for now. If you are interested and want to mess it up, are-we-gfx1100-yet/composable_kernel might be a |
For anyone interested, I am posting a slightly updated version of this: EDIT: Ops! Wrong window!!! But, I am leaving this here in case anyone wants it. |
Are there still people who are waiting for 7900XTX support? Though the performance is still a bit poor, TensorFlow-upstream now runs when built on the latest ROCm release. I was looking into the status of ROCm support for 7900XTX and found a few issues opened by different people and wanted to link all to the issue I opened in MIOpen repo. Though there has not been any confirmation from the developer, I think the performance issues are due to insufficient optimization of MIOpen. |
use ubuntu 22.04 and rocm 5.7.1 |
@johnnynunez |
It’s running 7900xt, I’ve check it. |
did you compile tensorflow-upstream master or r2.14-enhanced-rocm? |
I think, at the time I ran the benchmark, the master was 2.14. Now when I want to run the benchmark, I build r2.14 as I noticed some incompatibility when running the benchmark using the master. I haven't worked with my 7900XTX for a while since I bought MI100. So, I may not remember the version number correctly. But the gist is that master branch used to work but not anymore and I had to pick a version. |
I've updated the scripts to build with last master commit and rocm 5.7.1 if you want. Secondly modify this line. In my case 32gb 16 cores and 32 threads.
|
BTW, r2.14-enhanced-rocm has typo that prevents it from detecting 7900XTX properly. You need to fix tensorflow/compiler/xla/stream_executor/device_description.h line 184. It's missing a comma. I'm not sure what is going on since it was fixed multiple times in the past. But it keeps coming back... I think the master branch is OK. |
Yes I knew it and fix it |
This is not fixed in the recent 2.14 dockerfile push. How can i manually compile this one file and correct it? |
The code on the main development branch looks correct, and you can give the CI link in this comment a try, which contains the nightly |
Ok that worked but this is long way for newbies such as myself from using. I will keep redoing the steps I have found on rocm docs instead of the amd driver website that is giving me dkms errors. |
Yes, unfortunately, we still do not have |
In my case, I have still freeze with memory transfer etc |
So, apart from the .whl nightly build recommendation, can I as an owner of 6700xt do to get the faulty rocm/tensorflow:latest docker image running? Is there a possibility to recompile tensorflow within the docker image after fixing the comma in the .h file? |
I am working on a tutorial to work for my 7900 xtx and 6600 xt. https://github.com/vampireLibrarianMonk/amd-gpu-hello I do not yet have the download and manual compilation/installation of tensorflow-upstream of 2.15 and above but it will borrow a lot from this post. |
These two comments should help:
These steps might work (I don't have access to a machine for testing at the moment):
You may need to be |
Unfortunately, the build fails on FileNotFoundError: [Errno 2] No such file or directory: '/usr/lib/llvm-17/bin/clang'
|
Review the rocm enhanced branches. The latest usually isn’t the best place to start. |
I updated your repository, and I can compile pytorch and tensorflow with the latest versions. |
Is anyone gonna update these docs? https://github.com/ROCm/tensorflow-upstream/tree/develop-upstream/rocm_docs Seems pretty dated and if not for the last comment I would be lost. |
@Mushoz Has your issue been resolved? If so, please close the ticket. Thanks! |
Hi @Mushoz, with ROCm 6.2.0 and TensorFlow 2.16.1, I was able to run the example on a 7900XTX without encountering any issues. Successful runs were done after installing TensorFlow using the prebuilt Docker image
and using the ROCm 6.2.0 + TF 2.1.6.1 wheels package
There have been many updates/fixes since the issues in this thread have been posted. If anyone does encounter further issues using TensorFlow with ROCm on the 7900XTX, please open a new issue so we investigate it further. Thanks! |
Issue Type
Bug
Tensorflow Version
Tensorflow-rocm v2.11.0-3797-gfe65ef3bbcf 2.11.0
rocm Version
5.4.1
Custom Code
Yes
OS Platform and Distribution
Archlinux: Kernel 6.1.1
Python version
3.10
GPU model and memory
7900 XTX 24GB
Current Behaviour?
I am not entirely sure whether this is an upstream (ROCM) issue, or with Tensorflow-rocm specifically, so I am reporting it to both repo's. A toy example refuses to run and dumps core. I would have expected it to train successfully.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: