Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ROCm support #756

Closed
wants to merge 8 commits into from
Closed

Conversation

arlo-phoenix
Copy link

@arlo-phoenix arlo-phoenix commented Sep 8, 2023

Edit: See #756 (comment) for current status of ROCm support

Inspired by the llama.cpp ROCm port, I decided to try and use a similar approach for bitsandbytes and worked through the different hipified cuda functions/classes and just redefine them with the HIP equivalents. This only happens if BNB_USE_HIP is set and merging this shouldn't affect the CUDA code at all. It's also easier to maintain than keeping a parallel hip code base alive.

This PR adds the target hip to make and works with the most recent version (0.42.0) with ROCm 5.6+ (6.0 included). For installing just do

# Your ROCM_TARGET can be found with rocminfo | grep gfx
ROCM_TARGET=gfx1030 make hip
pip install .

It won't pass all tests as some are igemm or Cuda specific, but all optimizers work in both 8bit and 32bit. I also used this a lot with llama 4-bit interference, that also works. The tests that fail are beside those test_autograd.py and anything with double_quant in its name, I assume that also has to do with matrix multiplication and is expected to fail.

Besides that igemm / Matrix core support for the more recent AMD GPU's is still impossible because of missing instructions in hipBLASLt. There is also an official fork which tries to enable it, but doesn't seem finished yet. If you want to use the official fork without hipblasLt @IMbackK provided a patch for it which should work on all ROCm supported GPU's.

I'm making this a draft for now, as it is still not well tested and I haven't really updated the documentation yet. From an actual code standpoint not much will change on my side as I only own a gfx1030 GPU and thus can't test igemm support.

Closes #47, closes #107, closes #681

@AloneLiberty
Copy link

https://github.com/TimDettmers/bitsandbytes/blob/a13af4542f5a163d8e9eacb0b8b2a5b26d9e2b15/Makefile#L19-L23

You have logic bug there, if ROCM_HOME is empty, it will skip ROCM_TARGET check and will try to compile anyway (will fail later due to missing libraries/headers)

@amd-zoybai
Copy link

amd-zoybai commented Sep 26, 2023

Tested on Ubuntu 22.04.3 with RX 7900 XTX and passed with the following command:
ROCM_HOME=/opt/rocm ROCM_TARGET=gfx1100 make hip

Without ROCM_HOME make cannot find proper headers to build.

@CorvetteCole
Copy link

CorvetteCole commented Oct 5, 2023

I get this error: fatal error: 'hipblaslt/hipblaslt.h' file not found which I can't figure out

@CorvetteCole
Copy link

Can compile using the official rocm pytorch docker image and then pip install on the original desktop environment and it works just fine though

@arlo-phoenix
Copy link
Author

https://github.com/TimDettmers/bitsandbytes/blob/a13af4542f5a163d8e9eacb0b8b2a5b26d9e2b15/Makefile#L19-L23

You have logic bug there, if ROCM_HOME is empty, it will skip ROCM_TARGET check and will try to compile anyway (will fail later due to missing libraries/headers)

Not really that used to working with Makefiles directly. If it's empty it should skip that check since it doesn't matter what target you set as it won't be able to find libraries / binaries with ROCM_HOME being empty. I just tried to roughly copy the cuda checks since that also only takes one parameter. I agree it should throw an error, I'll see what I can do, thanks.

Tested on Ubuntu 22.04.3 with RX 7900 XTX and passed with the following command: ROCM_HOME=/opt/rocm ROCM_TARGET=gfx1100 make hip

Without ROCM_HOME make cannot find proper headers to build.

If ROCM_HOME isn't already set it will try to automatically find with the command which hipcc | rev | cut -d'/' -f4- | rev. Could you check which hipcc? For me (in the official docker image) it's at /opt/rocm/hip/bin/hipcc. I assume it prbly finds the other hipcc link which is at /opt/rocm/bin/hipcc which would require using f3 in that cut. An alternative to using that command to find it could be to just default to /opt/rocm since that seems to be the usual spot.

I get this error: fatal error: 'hipblaslt/hipblaslt.h' file not found which I can't figure out

hipblaslt is available since ROCM 5.6 (which is why I called my fork that). If you are below that this fork won't work and I'd recommend any another one that can be found in the linked issues (no 4bit support though). I dunno if ROCm 5.7 changed anything. From what I read 6.0 is gonna be the release where stuff isn't backwards compatible anymore so I don't think that should already be the case.

@person4268
Copy link

FWIW, Arch Linux's ROCM doesn't seem to be distributing hipblaslt (yet), so I am also getting missing hipblaslt.h errors on 5.6 there. It doesn't seem to be a straightforward (for me) thing to build, either.

@arlo-phoenix
Copy link
Author

FWIW, Arch Linux's ROCM doesn't seem to be distributing hipblaslt (yet), so I am also getting missing hipblaslt.h errors on 5.6 there. It doesn't seem to be a straightforward (for me) thing to build, either.

You don't need the actual library, just headers, as igenn is still disabled for this anyways (they were needed for this to compile though, I could've probably just made some placeholder defines so it compiles, but since I had the header I decided not to). The headers can be found under https://github.com/ROCmSoftwarePlatform/hipBLASLt/tree/develop/library/include and you could just put those in /opt/rocm/include. I don't know if they have any other dependencies, but other than that it should work.

@person4268
Copy link

person4268 commented Oct 9, 2023

That worked (at least as far as being able to execute python -m bitsandbytes, we'll see if it works once the model I'm trying to use finishes downloading), thanks - though do note that I did have to manually create hipblaslt-export.h and hipblaslt-version.h by hand, as those are generated at build time. (just copied the ones from hipblas and pretty much just just did s/HIPBLAS/HIPBLASLT/g in both upper and lower case)

@ccbadd
Copy link

ccbadd commented Oct 31, 2023

How do you deal with multiple gpus of different targets. I have both MI100s (gfx908) and W6800s (gfx1030) in my machine. Can I use ROCM_TARGET=gfx908;gfx1030?

@bog-dan-ro
Copy link

Any joy with this patch?

@fakerybakery
Copy link

Hi, is there any update on this? Would love to have this merged!

@Wintoplay
Copy link

@arlo-phoenix pls add support for ROCM 5.7

@arlo-phoenix
Copy link
Author

arlo-phoenix commented Dec 16, 2023

Hi, is there any update on this? Would love to have this merged!

Since there seems to now be an official plan to extend support to multiple platforms / hardware targets this will probably have to adjust if it's gonna be merged. I personally want to wait for ROCm 6.0 since that might break stuff. And even if this gets merged it would just be basic support without matrix cores as I don't have a more recent one. And even if I did, the hipblasLt project (at least according to their docs) only officially supports gfx90a, gfx94x (I assume it's gonna expand to MI300 as well, but I haven't seen anything about RDNA 3). The previously missing instructions from hipblasLt are now there (even if they don't respond to issues .-., had to grep) so it's not impossible to add, but as I said I can't test so it'd be nice if someone else did that.

@arlo-phoenix pls add support for ROCM 5.7

While ROCm is annoying with having to compile for each arch, nothing should've broken between ROCm 5.6 and 5.7 (will likely happen with 6.0).

Since many people still have problems with building this:
I added an /opt/rocm fallback on the rocm branch for this PR. I also updated the main branch of my fork which now includes a hipblaslt-compat header so you don't actually need hipblaslt as a lot of distros don't distribute it yet (just tried to build this on an arch system). The only reason I named my fork so specific was for that header., so these build instructions should work even before ROCm 5.6 and on a lot more systems:

git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6
cd bitsandbytes-rocm-5.6

ROCM_TARGET=gfx1030 make hip
pip install .

At least on my system you don't even need to set the ROCM_TARGET as it will just build it for all targets. I still recommend it for a faster build process. For finding it use

/opt/rocm/bin/rocminfo | grep gfx

or just look for your GPU under https://www.llvm.org/docs/AMDGPUUsage.html#processors

Edit: Just noticed 6.0 was already out .-., I'll update this once the official docker images are updated as well

@Wintoplay
Copy link

@arlo-phoenix rocm 6.0 docker has arrived https://hub.docker.com/r/rocm/pytorch/tags

@arlo-phoenix
Copy link
Author

@arlo-phoenix rocm 6.0 docker has arrived https://hub.docker.com/r/rocm/pytorch/tags

Thanks for the info! Only tested the basic stuff and will probably only further test after the holidays, but it still compiles, 4 bit works and all optimizers also all work (at least according to pytest). So summed up ROCm 6.0 breaks nothing in this after all.

@Titus-von-Koeller, I only skimmed through #898, but from what I see the idea is to add the ability to have different backends with one of them being the current implementation now under a CudaBackend. From my perspective this won't really change this PR that much then (only gotta move some checks) since there isn't really a need for a separate backend for HIP and AMD GPU's should just use the CudaBackend as well.

  • One improvement could be moving the defines to a separate header hip-compat.h so it's better separated.
  • The Makefile definitely still needs work, as already said never worked with them directly
  • If there is a move towards a CMakeFile for Windows Support (I think there are several PR's) I could try to make this work with CMake. Should be easier to add good integration that doesn't bother Cuda compilation as I'm more experienced with that

@IMbackK
Copy link

IMbackK commented Dec 22, 2023

One thing i should note about this pr is that since it dose not support wave64 it should really refuse to compile on those, or assert at run time, right now it produces incorrect results.

all amd ai/compute focused gpus are wave64 only (ie mi25,mi50,mi100,m210 all the way to the latest mi300) its only consumer gpus newer than radeon VII that can do both wave64 and wave32 so this pr excludes the very gpus that are best suited to be used in ml.

@purefire
Copy link

Hi, is there any update on this? Would love to have this merged!

Since there seems to now be an official plan to extend support to multiple platforms / hardware targets this will probably have to adjust if it's gonna be merged. I personally want to wait for ROCm 6.0 since that might break stuff. And even if this gets merged it would just be basic support without matrix cores as I don't have a more recent one. And even if I did, the hipblasLt project (at least according to their docs) only officially supports gfx90a, gfx94x (I assume it's gonna expand to MI300 as well, but I haven't seen anything about RDNA 3). The previously missing instructions from hipblasLt are now there (even if they don't respond to issues .-., had to grep) so it's not impossible to add, but as I said I can't test so it'd be nice if someone else did that.

@arlo-phoenix pls add support for ROCM 5.7

While ROCm is annoying with having to compile for each arch, nothing should've broken between ROCm 5.6 and 5.7 (will likely happen with 6.0).

Since many people still have problems with building this: I added an /opt/rocm fallback on the rocm branch for this PR. I also updated the main branch of my fork which now includes a hipblaslt-compat header so you don't actually need hipblaslt as a lot of distros don't distribute it yet (just tried to build this on an arch system). The only reason I named my fork so specific was for that header., so these build instructions should work even before ROCm 5.6 and on a lot more systems:

git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6
cd bitsandbytes-rocm-5.6

ROCM_TARGET=gfx1030 make hip
pip install .

At least on my system you don't even need to set the ROCM_TARGET as it will just build it for all targets. I still recommend it for a faster build process. For finding it use

/opt/rocm/bin/rocminfo | grep gfx

or just look for your GPU under https://www.llvm.org/docs/AMDGPUUsage.html#processors

Edit: Just noticed 6.0 was already out .-., I'll update this once the official docker images are updated as well

Hello, any news about 6.0 update?

@Iron-Bound
Copy link

Seconding this ^
Mi300 and H100 are both battling at the moment, so would like to use my 7900xtx!

@arlo-phoenix
Copy link
Author

@purefire

Hello, any news about 6.0 update?

Status quo is still

Thanks for the info! Only tested the basic stuff and will probably only further test after the holidays, but it still compiles, 4 bit works and all optimizers also all work (at least according to pytest). So summed up ROCm 6.0 breaks nothing in this after all.

not finetuning anything atm, but since it still compiles and tests succeed it should still work as expected. The only thing I expected to break was the makefile or some includes or defines becoming deprecated, but didn't see anything.


@Iron-Bound

Seconding this ^ Mi300 and H100 are both battling at the moment, so would like to use my 7900xtx!

7900XTX should work, it's wavefront 64 that doesn't work and 7900XTX has the normal wavefront size 32. It would not become a battle here though as this doesn't support hipblaslt yet meaning no matrix cores are used and so the 7900XTX /MI300 wouldn't perform well at all. This isn't something I can implement/test myself so someone else will need to do that. The changes shouldn't be too large, just a small python check if the hip device supports hipblaslt where gemm support is checked and adjusting the Makefile to actually use the library.


@IMbackK

One thing i should note about this pr is that since it dose not support wave64 it should really refuse to compile on those, or assert at run time, right now it produces incorrect results.

all amd ai/compute focused gpus are wave64 only (ie mi25,mi50,mi100,m210 all the way to the latest mi300) its only consumer gpus newer than radeon VII that can do both wave64 and wave32 so this pr excludes the very gpus that are best suited to be used in ml.

That's interesting, didn't find anything last time because I didn't bother into looking into large architecture description PDF's just to look for a wavefront size, but you are right. Then it's a bit more important, I assumed it was only the CDNA1 that was just supporting wavefront size 64.

I'll try to think of a good way to include them anyways. The wavefront size override won't actually affect how everything is executed, it's just that some compile time asserts are not triggered anymore (from what I remember). The define override should still be removed / only be called if something like FORCE_WAVEFRONT32 is set. I'll try to see if I can just trigger a trap in device code for the unsupported functions so it compiles. If that's actually the case it would be enough to add a one time warning with a fallback in the affected block size functions to actually use the next larger block size or throw an exception if that doesn't work. The problem with the second solution will be that most projects use the smallest BLOCK_SIZE for 4bit stuff which means e.g. https://github.com/TimDettmers/bitsandbytes/blob/f63abb5a0d0bc971d28972ba890a9e59596caac4/csrc/kernels.cu#L3976 for FP4 is called which doesn't work with the larger wavefront size of 64. So that fallback / exception if impossible with tensor size would need to go here https://github.com/TimDettmers/bitsandbytes/blob/f63abb5a0d0bc971d28972ba890a9e59596caac4/bitsandbytes/functional.py#L690

Not experienced at all with that so no idea if that even works, but if it doesn't work to just use the next larger BLOCK_SIZE we can always just throw an exception and nothing that doesn't work would be called. Same would need to be done for dequantize.


@arlo-phoenix
Copy link
Author

arlo-phoenix commented Jan 12, 2024

@IMbackK my latest commit tries to adjust the kQuantizeBlockwise function to work with wavefront 64 (just only use 1 load, etc. per thread instead of 2). Didn't really test much, but tests succeeded for FP16 (might cause issues with nf4,fp4). I believe test_generation.py uses the 64 BLOCK_SIZE variant as well which would trigger the adjusted code and those succeeded. Even if my code doesn't work, I believe it shouldn't be too hard to edit that function to work anyways under WAVEFRONT/warp size 64. This makes integration a lot easier and wouldn't require a bunch of ifdefs or guards in python code.

Sadly this is very hard to test as most tests don't just fail, but crash without igemm support, I can run them 1 by 1 but this is a bit annoying. I'll see if I there's a setting to pytest to just continue or if this is easily adjustable.

@Iron-Bound I found this https://github.com/ROCmSoftwarePlatform/bitsandbytes/tree/rocm_enabled linked on some huggingface docs. That fork is almost as old as mine, I wonder why they never advertised it, but they link hipblaslt and don't disable it from what I saw in the makefile. Like the output library is called libbitsandbytes_hip_nohipblaslt, but didn't actually see it disabled so it might be worth to give this a try. I also don't think tinygrad is really gonna help, I have no clue how hipblaslt works internally and don't plan on researching to replicate functions, so the only thing I can do is use functions they provide to replace the cublaslt functions.

@IMbackK
Copy link

IMbackK commented Jan 12, 2024

@arlo-phoenix I can confirm that kQuantizeBlockwise works on gfx90* now and this solution is sane. However over all i am leaning towards amds solution of hipifying once and keeping the hip code seperate, as this would allow further optimization for gcn/cdna without a mess of ifdefs

The amd repo is also interesting, the reason they haven't published it widely is presumably because it is still quite unfinished as evidenced by the myriad of disabled tests and the makefile jank. That said after patching the makefile as below to disable hipblaslt usage amds version also works (tested on gfx1030, gfx900, gfx906 and gfx908) for the purposes of the non-amd-disabled tests i tried, transformers bnb integration and the 8bit adamw.

diff --git a/Makefile b/Makefile
index 1e12c8e..4d473b8 100644
--- a/Makefile
+++ b/Makefile
@@ -31,7 +31,7 @@ INCLUDE :=  -I $(CUDA_HOME)/include -I $(ROOT_DIR)/csrc -I $(CONDA_PREFIX)/inclu
 LIB := -L $(CUDA_HOME)/lib64 -lcudart -lcublas -lcublasLt -lcusparse -L $(CONDA_PREFIX)/lib
 
 INCLUDE_ROCM := -I $(ROCM_HOME)/include -I $(ROOT_DIR)/csrc -I $(CONDA_PREFIX)/include -I $(ROOT_DIR)/include
-LIB_ROCM := -L $(ROCM_HOME)/lib -lhipblas -lhipblaslt -lhiprand -lhipsparse -L $(CONDA_PREFIX)/lib
+LIB_ROCM := -L $(ROCM_HOME)/lib -lhipblas -lhiprand -lhipsparse -L $(CONDA_PREFIX)/lib
 
 # NVIDIA NVCC compilation flags
 COMPUTE_CAPABILITY += -gencode arch=compute_50,code=sm_50 # Maxwell
@@ -107,9 +107,9 @@ cuda12x: $(BUILD_DIR) env
        $(GPP) -std=c++14 -DBUILD_CUDA -shared -fPIC $(INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(BUILD_DIR)/link.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cuda$(CUDA_VERSION).so $(LIB)
 
 hip: $(BUILD_DIR) env
-       $(HIPCC) -std=c++14 -fPIC -c $(INCLUDE_ROCM) $(LIB_ROCM) $(CSRC)/ops.hip -o $(BUILD_DIR)/ops.o
-       $(HIPCC) -std=c++14 -fPIC -c $(INCLUDE_ROCM) $(LIB_ROCM) $(CSRC)/kernels.hip -o $(BUILD_DIR)/kernels.o
-       $(GPP) -std=c++14 -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_HCC__ -DBUILD_HIP -shared -fPIC $(INCLUDE_ROCM) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_hip_nohipblaslt.so $(LIB_ROCM)
+       $(HIPCC) -std=c++14 -fPIC -c $(INCLUDE_ROCM) $(LIB_ROCM) -DNO_HIPBLASLT $(CSRC)/ops.hip -o $(BUILD_DIR)/ops.o
+       $(HIPCC) -std=c++14 -fPIC -c $(INCLUDE_ROCM) $(LIB_ROCM) -DNO_HIPBLASLT $(CSRC)/kernels.hip -o $(BUILD_DIR)/kernels.o
+       $(GPP) -std=c++14 -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_HCC__ -DNO_HIPBLASLT -DBUILD_HIP -shared -fPIC $(INCLUDE_ROCM) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_hip_nohipblaslt.so $(LIB_ROCM)
 
 cpuonly: $(BUILD_DIR) env
        $(GPP) -std=c++14 -shared -fPIC -I $(ROOT_DIR)/csrc -I $(ROOT_DIR)/include $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cpu.so

arlo-phoenix and others added 7 commits January 22, 2024 16:57
ROCM_TARGET=gfx1030 make hip
Uses define BITS_AND_BYTES_USE_ROCM to redefine cuda functions
to ROCm equivalent

credit to previous ports:

Co-authored-by: broncotc <i@broncotc.net>
Co-authored-by: agrocylo <130291676+agrocylo@users.noreply.github.com>
disables igemm for now and adds path to compiled library
libbitsandbytes_hip_nohipblaslt
the unrolls already somehow worked correctly before, but they shouldn't have.
@ehartford
Copy link

I wanted to try this out, so I installed it, (and also https://github.com/ROCmSoftwarePlatform/triton which it needs) and I got:

cannot import name 'cdiv' from 'triton'

If you happen to recignize this error, could you please set me straight?

@arlo-phoenix
Copy link
Author

arlo-phoenix commented Jan 28, 2024

I wanted to try this out, so I installed it, (and also https://github.com/ROCmSoftwarePlatform/triton which it needs) and I got:

cannot import name 'cdiv' from 'triton'

If you happen to recignize this error, could you please set me straight?

sry never needed to use triton for what I did so far. I assume you are on a newer GPU then with matrix core support. If so I recommend giving the official fork a try. There's been a lot of progress in the last weeks from what I saw and should give a more stable/faster experience.

Edit: The docker had pytorch-triton-rocm installed which also provided triton on it's own. tests fail though (seem to fail on the official fork as well, so if you actively need it, think it's for something like row-wise, might not work)

@arlo-phoenix
Copy link
Author

arlo-phoenix commented Jan 28, 2024

@arlo-phoenix I can confirm that kQuantizeBlockwise works on gfx90* now and this solution is sane. However over all i am leaning towards amds solution of hipifying once and keeping the hip code seperate, as this would allow further optimization for gcn/cdna without a mess of ifdefs

I agree. I'm keeping this PR open for now since the official fork still only works with hipblaslt without your patch so a working fork is easier to find, but definitely not something that should be merged anymore. The reason I opted for defines is that I really didn't like filling the pythoninterface.c with all the ifdefs and I was/am too inexperienced to come up with a solution for multiple backends. But now with there being an effort to create a proper backend system this won't really matter anymore. I skimmed through the source code and an ifdef solution wouldn't even have been possible since the API's differ a lot more after all.

If anyone from AMD is reading this (issues aren't enabled in the fork), are there any plans for integrating ROCm upstream in bitsandbytes? There is currently an ongoing discussion on how backends should be integrated and I think it makes more sense for someone currently working on this to chime in there #898.
From what I saw with it being planned slightly higher level with a thin wrapper around torch tensors would help for naming differences like col32 and col, I can also help upstreaming this if this isn't a priority for you.

@Iron-Bound
Copy link

Iron-Bound commented Jan 29, 2024

Hey wonderful AMD people, we are try to enable rocm support, could one of you assist or put us in contact with a person?

Given you currently have a fork it would be beneficial to have a long term up-streamed solution and save support effort overall.

If anyone from AMD is reading this (issues aren't enabled in the fork), are there any plans for integrating ROCm upstream in bitsandbytes? There is currently an ongoing discussion on how backends should be integrated and I think it makes more sense for someone currently working on this to chime in there #898. From what I saw with it being planned slightly higher level with a thin wrapper around torch tensors would help for naming differences like col32 and col, I can also help upstreaming this if this isn't a priority for you.

@Lzy17 @howiejayz @CRobeck @kuhar @jerryyin @keryell @jeffdaily @keryell @dllehr-amd

@CRobeck
Copy link

CRobeck commented Jan 29, 2024

@Iron-Bound @amathews-amd is grabbing the right folks on our side.

@ehartford
Copy link

Thank you! This is huge!

@Titus-von-Koeller
Copy link
Collaborator

Thanks again everyone for supporting us on upstreaming AMD support.

Just wanted to write again to renew our support in making that work out. Currently, there's a lot of work on deciding on how to best handle the backend abstraction in #1077 and #898. Once we're through with that relatively soon the path is free for follow-up PRs with enabling the individual functionalities in the AMD backend.

Feel free to also chip in on the backend abstraction discussions. This decision will likely set some stuff "in stone", so it is important to get it right.

@arlo-phoenix
Copy link
Author

arlo-phoenix commented Apr 20, 2024

The ROCm fork: https://github.com/ROCm/bitsandbytes/tree/rocm_enabled
has merged CMake support.

That fork should work:

  • under the currently latest commit c037a30
  • With ROCm 6.0 (6.1 would enable hipblasLt which might break things for non CDNA2+)

for all ROCm supported GPUs. They also marked still failing tests for skipping so all tests under tests/ shouldn't fail (at least did for me on gfx1030).

For installing from an empty venv:

  1. Install requirement ROCm torch
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.0
  1. Install bitsandbytes

you can find your AMDGPU_TARGET with rocminfo | grep gfx

git clone https://github.com/ROCm/bitsandbytes.git
cd bitsandbytes
git checkout c037a306e97ced3c452570132f66aac4e2964056
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -DCOMPUTE_BACKEND=hip -DAMDGPU_TARGETS=gfx1030 -S .
cmake --build . --config Release
pip install .

I think I've kept this open long enough so I'm closing this. The official fork now has better support for all devices and is easier to find than when I initially discovered it.
I hope my fork and little guide at the end was helpful. I don't plan to actively contribute anything regarding bitsandbytes rocm myself anymore. I'll probably show up again when ROCm gets upstreamed (if RDNA doesn't work), but otherwise: Goodbye and happy finetuning.

@Iron-Bound
Copy link

Thanks @arlo-phoenix and everyone involved to get this done ❤️

@ehartford
Copy link

what about 6.1?

@IMbackK
Copy link

IMbackK commented Apr 29, 2024

what about it?

@Titus-von-Koeller
Copy link
Collaborator

For those willing to alpha test, the ROCm backend is already available for that when compiling from source from the multi-backend-refactor branch.

See today's change to the Readme for more details.

@ehartford
Copy link

Very interested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AMD integration high priority (first issues that will be worked on) High Risk Risk of bugs in transformers and other libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bitsandbytes-rocm rocm-5.6.0 support Feature Request: ROCm support (AMD GPU) ROCM Support