Add ROCm support #756

arlo-phoenix · 2023-09-08T13:22:00Z

Edit: See #756 (comment) for current status of ROCm support

Inspired by the llama.cpp ROCm port, I decided to try and use a similar approach for bitsandbytes and worked through the different hipified cuda functions/classes and just redefine them with the HIP equivalents. This only happens if BNB_USE_HIP is set and merging this shouldn't affect the CUDA code at all. It's also easier to maintain than keeping a parallel hip code base alive.

This PR adds the target hip to make and works with the most recent version (0.42.0) with ROCm 5.6+ (6.0 included). For installing just do

# Your ROCM_TARGET can be found with rocminfo | grep gfx
ROCM_TARGET=gfx1030 make hip
pip install .

It won't pass all tests as some are igemm or Cuda specific, but all optimizers work in both 8bit and 32bit. I also used this a lot with llama 4-bit interference, that also works. The tests that fail are beside those test_autograd.py and anything with double_quant in its name, I assume that also has to do with matrix multiplication and is expected to fail.

Besides that igemm / Matrix core support for the more recent AMD GPU's is still impossible because of missing instructions in hipBLASLt. There is also an official fork which tries to enable it, but doesn't seem finished yet. If you want to use the official fork without hipblasLt @IMbackK provided a patch for it which should work on all ROCm supported GPU's.

I'm making this a draft for now, as it is still not well tested and I haven't really updated the documentation yet. From an actual code standpoint not much will change on my side as I only own a gfx1030 GPU and thus can't test igemm support.

Closes #47, closes #107, closes #681

AloneLiberty · 2023-09-23T00:37:14Z

https://github.com/TimDettmers/bitsandbytes/blob/a13af4542f5a163d8e9eacb0b8b2a5b26d9e2b15/Makefile#L19-L23

You have logic bug there, if ROCM_HOME is empty, it will skip ROCM_TARGET check and will try to compile anyway (will fail later due to missing libraries/headers)

amd-zoybai · 2023-09-26T17:39:52Z

Tested on Ubuntu 22.04.3 with RX 7900 XTX and passed with the following command:
ROCM_HOME=/opt/rocm ROCM_TARGET=gfx1100 make hip

Without ROCM_HOME make cannot find proper headers to build.

CorvetteCole · 2023-10-05T06:21:37Z

I get this error: fatal error: 'hipblaslt/hipblaslt.h' file not found which I can't figure out

CorvetteCole · 2023-10-05T13:43:05Z

Can compile using the official rocm pytorch docker image and then pip install on the original desktop environment and it works just fine though

arlo-phoenix · 2023-10-07T14:28:02Z

https://github.com/TimDettmers/bitsandbytes/blob/a13af4542f5a163d8e9eacb0b8b2a5b26d9e2b15/Makefile#L19-L23

You have logic bug there, if ROCM_HOME is empty, it will skip ROCM_TARGET check and will try to compile anyway (will fail later due to missing libraries/headers)

Not really that used to working with Makefiles directly. If it's empty it should skip that check since it doesn't matter what target you set as it won't be able to find libraries / binaries with ROCM_HOME being empty. I just tried to roughly copy the cuda checks since that also only takes one parameter. I agree it should throw an error, I'll see what I can do, thanks.

Tested on Ubuntu 22.04.3 with RX 7900 XTX and passed with the following command: ROCM_HOME=/opt/rocm ROCM_TARGET=gfx1100 make hip

Without ROCM_HOME make cannot find proper headers to build.

If ROCM_HOME isn't already set it will try to automatically find with the command which hipcc | rev | cut -d'/' -f4- | rev. Could you check which hipcc? For me (in the official docker image) it's at /opt/rocm/hip/bin/hipcc. I assume it prbly finds the other hipcc link which is at /opt/rocm/bin/hipcc which would require using f3 in that cut. An alternative to using that command to find it could be to just default to /opt/rocm since that seems to be the usual spot.

I get this error: fatal error: 'hipblaslt/hipblaslt.h' file not found which I can't figure out

hipblaslt is available since ROCM 5.6 (which is why I called my fork that). If you are below that this fork won't work and I'd recommend any another one that can be found in the linked issues (no 4bit support though). I dunno if ROCm 5.7 changed anything. From what I read 6.0 is gonna be the release where stuff isn't backwards compatible anymore so I don't think that should already be the case.

person4268 · 2023-10-08T08:34:42Z

FWIW, Arch Linux's ROCM doesn't seem to be distributing hipblaslt (yet), so I am also getting missing hipblaslt.h errors on 5.6 there. It doesn't seem to be a straightforward (for me) thing to build, either.

arlo-phoenix · 2023-10-08T11:03:35Z

FWIW, Arch Linux's ROCM doesn't seem to be distributing hipblaslt (yet), so I am also getting missing hipblaslt.h errors on 5.6 there. It doesn't seem to be a straightforward (for me) thing to build, either.

You don't need the actual library, just headers, as igenn is still disabled for this anyways (they were needed for this to compile though, I could've probably just made some placeholder defines so it compiles, but since I had the header I decided not to). The headers can be found under https://github.com/ROCmSoftwarePlatform/hipBLASLt/tree/develop/library/include and you could just put those in /opt/rocm/include. I don't know if they have any other dependencies, but other than that it should work.

person4268 · 2023-10-09T20:16:15Z

That worked (at least as far as being able to execute python -m bitsandbytes, we'll see if it works once the model I'm trying to use finishes downloading), thanks - though do note that I did have to manually create hipblaslt-export.h and hipblaslt-version.h by hand, as those are generated at build time. (just copied the ones from hipblas and pretty much just just did s/HIPBLAS/HIPBLASLT/g in both upper and lower case)

ccbadd · 2023-10-31T15:07:41Z

How do you deal with multiple gpus of different targets. I have both MI100s (gfx908) and W6800s (gfx1030) in my machine. Can I use ROCM_TARGET=gfx908;gfx1030?

bog-dan-ro · 2023-12-11T13:01:30Z

Any joy with this patch?

fakerybakery · 2023-12-16T00:42:11Z

Hi, is there any update on this? Would love to have this merged!

Wintoplay · 2023-12-16T01:42:09Z

@arlo-phoenix pls add support for ROCM 5.7

arlo-phoenix · 2023-12-16T11:00:19Z

Hi, is there any update on this? Would love to have this merged!

Since there seems to now be an official plan to extend support to multiple platforms / hardware targets this will probably have to adjust if it's gonna be merged. I personally want to wait for ROCm 6.0 since that might break stuff. And even if this gets merged it would just be basic support without matrix cores as I don't have a more recent one. And even if I did, the hipblasLt project (at least according to their docs) only officially supports gfx90a, gfx94x (I assume it's gonna expand to MI300 as well, but I haven't seen anything about RDNA 3). The previously missing instructions from hipblasLt are now there (even if they don't respond to issues .-., had to grep) so it's not impossible to add, but as I said I can't test so it'd be nice if someone else did that.

@arlo-phoenix pls add support for ROCM 5.7

While ROCm is annoying with having to compile for each arch, nothing should've broken between ROCm 5.6 and 5.7 (will likely happen with 6.0).

Since many people still have problems with building this:
I added an /opt/rocm fallback on the rocm branch for this PR. I also updated the main branch of my fork which now includes a hipblaslt-compat header so you don't actually need hipblaslt as a lot of distros don't distribute it yet (just tried to build this on an arch system). The only reason I named my fork so specific was for that header., so these build instructions should work even before ROCm 5.6 and on a lot more systems:

git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6
cd bitsandbytes-rocm-5.6

ROCM_TARGET=gfx1030 make hip
pip install .

At least on my system you don't even need to set the ROCM_TARGET as it will just build it for all targets. I still recommend it for a faster build process. For finding it use

/opt/rocm/bin/rocminfo | grep gfx

or just look for your GPU under https://www.llvm.org/docs/AMDGPUUsage.html#processors

Edit: Just noticed 6.0 was already out .-., I'll update this once the official docker images are updated as well

Wintoplay · 2023-12-21T03:37:28Z

@arlo-phoenix rocm 6.0 docker has arrived https://hub.docker.com/r/rocm/pytorch/tags

arlo-phoenix · 2023-12-21T18:34:14Z

@arlo-phoenix rocm 6.0 docker has arrived https://hub.docker.com/r/rocm/pytorch/tags

Thanks for the info! Only tested the basic stuff and will probably only further test after the holidays, but it still compiles, 4 bit works and all optimizers also all work (at least according to pytest). So summed up ROCm 6.0 breaks nothing in this after all.

@Titus-von-Koeller, I only skimmed through #898, but from what I see the idea is to add the ability to have different backends with one of them being the current implementation now under a CudaBackend. From my perspective this won't really change this PR that much then (only gotta move some checks) since there isn't really a need for a separate backend for HIP and AMD GPU's should just use the CudaBackend as well.

One improvement could be moving the defines to a separate header hip-compat.h so it's better separated.
The Makefile definitely still needs work, as already said never worked with them directly
If there is a move towards a CMakeFile for Windows Support (I think there are several PR's) I could try to make this work with CMake. Should be easier to add good integration that doesn't bother Cuda compilation as I'm more experienced with that

IMbackK · 2023-12-22T14:23:15Z

One thing i should note about this pr is that since it dose not support wave64 it should really refuse to compile on those, or assert at run time, right now it produces incorrect results.

all amd ai/compute focused gpus are wave64 only (ie mi25,mi50,mi100,m210 all the way to the latest mi300) its only consumer gpus newer than radeon VII that can do both wave64 and wave32 so this pr excludes the very gpus that are best suited to be used in ml.

purefire · 2023-12-28T07:52:40Z

Hi, is there any update on this? Would love to have this merged!

Since there seems to now be an official plan to extend support to multiple platforms / hardware targets this will probably have to adjust if it's gonna be merged. I personally want to wait for ROCm 6.0 since that might break stuff. And even if this gets merged it would just be basic support without matrix cores as I don't have a more recent one. And even if I did, the hipblasLt project (at least according to their docs) only officially supports gfx90a, gfx94x (I assume it's gonna expand to MI300 as well, but I haven't seen anything about RDNA 3). The previously missing instructions from hipblasLt are now there (even if they don't respond to issues .-., had to grep) so it's not impossible to add, but as I said I can't test so it'd be nice if someone else did that.

@arlo-phoenix pls add support for ROCM 5.7

While ROCm is annoying with having to compile for each arch, nothing should've broken between ROCm 5.6 and 5.7 (will likely happen with 6.0).

Since many people still have problems with building this: I added an /opt/rocm fallback on the rocm branch for this PR. I also updated the main branch of my fork which now includes a hipblaslt-compat header so you don't actually need hipblaslt as a lot of distros don't distribute it yet (just tried to build this on an arch system). The only reason I named my fork so specific was for that header., so these build instructions should work even before ROCm 5.6 and on a lot more systems:
git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6
cd bitsandbytes-rocm-5.6

ROCM_TARGET=gfx1030 make hip
pip install .
At least on my system you don't even need to set the ROCM_TARGET as it will just build it for all targets. I still recommend it for a faster build process. For finding it use
/opt/rocm/bin/rocminfo | grep gfx
or just look for your GPU under https://www.llvm.org/docs/AMDGPUUsage.html#processors

Edit: Just noticed 6.0 was already out .-., I'll update this once the official docker images are updated as well

Hello, any news about 6.0 update?

Iron-Bound · 2023-12-29T01:42:59Z

Seconding this ^
Mi300 and H100 are both battling at the moment, so would like to use my 7900xtx!

arlo-phoenix · 2023-12-30T12:21:45Z

@purefire

Hello, any news about 6.0 update?

Status quo is still

Thanks for the info! Only tested the basic stuff and will probably only further test after the holidays, but it still compiles, 4 bit works and all optimizers also all work (at least according to pytest). So summed up ROCm 6.0 breaks nothing in this after all.

not finetuning anything atm, but since it still compiles and tests succeed it should still work as expected. The only thing I expected to break was the makefile or some includes or defines becoming deprecated, but didn't see anything.

@Iron-Bound

Seconding this ^ Mi300 and H100 are both battling at the moment, so would like to use my 7900xtx!

7900XTX should work, it's wavefront 64 that doesn't work and 7900XTX has the normal wavefront size 32. It would not become a battle here though as this doesn't support hipblaslt yet meaning no matrix cores are used and so the 7900XTX /MI300 wouldn't perform well at all. This isn't something I can implement/test myself so someone else will need to do that. The changes shouldn't be too large, just a small python check if the hip device supports hipblaslt where gemm support is checked and adjusting the Makefile to actually use the library.

@IMbackK

One thing i should note about this pr is that since it dose not support wave64 it should really refuse to compile on those, or assert at run time, right now it produces incorrect results.

all amd ai/compute focused gpus are wave64 only (ie mi25,mi50,mi100,m210 all the way to the latest mi300) its only consumer gpus newer than radeon VII that can do both wave64 and wave32 so this pr excludes the very gpus that are best suited to be used in ml.

That's interesting, didn't find anything last time because I didn't bother into looking into large architecture description PDF's just to look for a wavefront size, but you are right. Then it's a bit more important, I assumed it was only the CDNA1 that was just supporting wavefront size 64.

I'll try to think of a good way to include them anyways. The wavefront size override won't actually affect how everything is executed, it's just that some compile time asserts are not triggered anymore (from what I remember). The define override should still be removed / only be called if something like FORCE_WAVEFRONT32 is set. I'll try to see if I can just trigger a trap in device code for the unsupported functions so it compiles. If that's actually the case it would be enough to add a one time warning with a fallback in the affected block size functions to actually use the next larger block size or throw an exception if that doesn't work. The problem with the second solution will be that most projects use the smallest BLOCK_SIZE for 4bit stuff which means e.g. https://github.com/TimDettmers/bitsandbytes/blob/f63abb5a0d0bc971d28972ba890a9e59596caac4/csrc/kernels.cu#L3976 for FP4 is called which doesn't work with the larger wavefront size of 64. So that fallback / exception if impossible with tensor size would need to go here https://github.com/TimDettmers/bitsandbytes/blob/f63abb5a0d0bc971d28972ba890a9e59596caac4/bitsandbytes/functional.py#L690

Not experienced at all with that so no idea if that even works, but if it doesn't work to just use the next larger BLOCK_SIZE we can always just throw an exception and nothing that doesn't work would be called. Same would need to be done for dequantize.

arlo-phoenix · 2024-01-12T17:08:18Z

@IMbackK my latest commit tries to adjust the kQuantizeBlockwise function to work with wavefront 64 (just only use 1 load, etc. per thread instead of 2). Didn't really test much, but tests succeeded for FP16 (might cause issues with nf4,fp4). I believe test_generation.py uses the 64 BLOCK_SIZE variant as well which would trigger the adjusted code and those succeeded. Even if my code doesn't work, I believe it shouldn't be too hard to edit that function to work anyways under WAVEFRONT/warp size 64. This makes integration a lot easier and wouldn't require a bunch of ifdefs or guards in python code.

Sadly this is very hard to test as most tests don't just fail, but crash without igemm support, I can run them 1 by 1 but this is a bit annoying. I'll see if I there's a setting to pytest to just continue or if this is easily adjustable.

@Iron-Bound I found this https://github.com/ROCmSoftwarePlatform/bitsandbytes/tree/rocm_enabled linked on some huggingface docs. That fork is almost as old as mine, I wonder why they never advertised it, but they link hipblaslt and don't disable it from what I saw in the makefile. Like the output library is called libbitsandbytes_hip_nohipblaslt, but didn't actually see it disabled so it might be worth to give this a try. I also don't think tinygrad is really gonna help, I have no clue how hipblaslt works internally and don't plan on researching to replicate functions, so the only thing I can do is use functions they provide to replace the cublaslt functions.

IMbackK · 2024-01-12T22:37:13Z

@arlo-phoenix I can confirm that kQuantizeBlockwise works on gfx90* now and this solution is sane. However over all i am leaning towards amds solution of hipifying once and keeping the hip code seperate, as this would allow further optimization for gcn/cdna without a mess of ifdefs

The amd repo is also interesting, the reason they haven't published it widely is presumably because it is still quite unfinished as evidenced by the myriad of disabled tests and the makefile jank. That said after patching the makefile as below to disable hipblaslt usage amds version also works (tested on gfx1030, gfx900, gfx906 and gfx908) for the purposes of the non-amd-disabled tests i tried, transformers bnb integration and the 8bit adamw.

diff --git a/Makefile b/Makefile
index 1e12c8e..4d473b8 100644
--- a/Makefile
+++ b/Makefile
@@ -31,7 +31,7 @@ INCLUDE :=  -I $(CUDA_HOME)/include -I $(ROOT_DIR)/csrc -I $(CONDA_PREFIX)/inclu
 LIB := -L $(CUDA_HOME)/lib64 -lcudart -lcublas -lcublasLt -lcusparse -L $(CONDA_PREFIX)/lib
 
 INCLUDE_ROCM := -I $(ROCM_HOME)/include -I $(ROOT_DIR)/csrc -I $(CONDA_PREFIX)/include -I $(ROOT_DIR)/include
-LIB_ROCM := -L $(ROCM_HOME)/lib -lhipblas -lhipblaslt -lhiprand -lhipsparse -L $(CONDA_PREFIX)/lib
+LIB_ROCM := -L $(ROCM_HOME)/lib -lhipblas -lhiprand -lhipsparse -L $(CONDA_PREFIX)/lib
 
 # NVIDIA NVCC compilation flags
 COMPUTE_CAPABILITY += -gencode arch=compute_50,code=sm_50 # Maxwell
@@ -107,9 +107,9 @@ cuda12x: $(BUILD_DIR) env
        $(GPP) -std=c++14 -DBUILD_CUDA -shared -fPIC $(INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(BUILD_DIR)/link.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cuda$(CUDA_VERSION).so $(LIB)
 
 hip: $(BUILD_DIR) env
-       $(HIPCC) -std=c++14 -fPIC -c $(INCLUDE_ROCM) $(LIB_ROCM) $(CSRC)/ops.hip -o $(BUILD_DIR)/ops.o
-       $(HIPCC) -std=c++14 -fPIC -c $(INCLUDE_ROCM) $(LIB_ROCM) $(CSRC)/kernels.hip -o $(BUILD_DIR)/kernels.o
-       $(GPP) -std=c++14 -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_HCC__ -DBUILD_HIP -shared -fPIC $(INCLUDE_ROCM) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_hip_nohipblaslt.so $(LIB_ROCM)
+       $(HIPCC) -std=c++14 -fPIC -c $(INCLUDE_ROCM) $(LIB_ROCM) -DNO_HIPBLASLT $(CSRC)/ops.hip -o $(BUILD_DIR)/ops.o
+       $(HIPCC) -std=c++14 -fPIC -c $(INCLUDE_ROCM) $(LIB_ROCM) -DNO_HIPBLASLT $(CSRC)/kernels.hip -o $(BUILD_DIR)/kernels.o
+       $(GPP) -std=c++14 -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_HCC__ -DNO_HIPBLASLT -DBUILD_HIP -shared -fPIC $(INCLUDE_ROCM) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_hip_nohipblaslt.so $(LIB_ROCM)
 
 cpuonly: $(BUILD_DIR) env
        $(GPP) -std=c++14 -shared -fPIC -I $(ROOT_DIR)/csrc -I $(ROOT_DIR)/include $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cpu.so

ROCM_TARGET=gfx1030 make hip Uses define BITS_AND_BYTES_USE_ROCM to redefine cuda functions to ROCm equivalent credit to previous ports: Co-authored-by: broncotc <i@broncotc.net> Co-authored-by: agrocylo <130291676+agrocylo@users.noreply.github.com>

disables igemm for now and adds path to compiled library libbitsandbytes_hip_nohipblaslt

the unrolls already somehow worked correctly before, but they shouldn't have.

ehartford · 2024-01-28T07:15:18Z

I wanted to try this out, so I installed it, (and also https://github.com/ROCmSoftwarePlatform/triton which it needs) and I got:

cannot import name 'cdiv' from 'triton'

If you happen to recignize this error, could you please set me straight?

arlo-phoenix · 2024-01-28T19:25:13Z

I wanted to try this out, so I installed it, (and also https://github.com/ROCmSoftwarePlatform/triton which it needs) and I got:

cannot import name 'cdiv' from 'triton'

If you happen to recignize this error, could you please set me straight?

sry never needed to use triton for what I did so far. I assume you are on a newer GPU then with matrix core support. If so I recommend giving the official fork a try. There's been a lot of progress in the last weeks from what I saw and should give a more stable/faster experience.

Edit: The docker had pytorch-triton-rocm installed which also provided triton on it's own. tests fail though (seem to fail on the official fork as well, so if you actively need it, think it's for something like row-wise, might not work)

arlo-phoenix · 2024-01-28T19:44:56Z

@arlo-phoenix I can confirm that kQuantizeBlockwise works on gfx90* now and this solution is sane. However over all i am leaning towards amds solution of hipifying once and keeping the hip code seperate, as this would allow further optimization for gcn/cdna without a mess of ifdefs

I agree. I'm keeping this PR open for now since the official fork still only works with hipblaslt without your patch so a working fork is easier to find, but definitely not something that should be merged anymore. The reason I opted for defines is that I really didn't like filling the pythoninterface.c with all the ifdefs and I was/am too inexperienced to come up with a solution for multiple backends. But now with there being an effort to create a proper backend system this won't really matter anymore. I skimmed through the source code and an ifdef solution wouldn't even have been possible since the API's differ a lot more after all.

If anyone from AMD is reading this (issues aren't enabled in the fork), are there any plans for integrating ROCm upstream in bitsandbytes? There is currently an ongoing discussion on how backends should be integrated and I think it makes more sense for someone currently working on this to chime in there #898.
From what I saw with it being planned slightly higher level with a thin wrapper around torch tensors would help for naming differences like col32 and col, I can also help upstreaming this if this isn't a priority for you.

Iron-Bound · 2024-01-29T09:14:48Z

Hey wonderful AMD people, we are try to enable rocm support, could one of you assist or put us in contact with a person?

Given you currently have a fork it would be beneficial to have a long term up-streamed solution and save support effort overall.

If anyone from AMD is reading this (issues aren't enabled in the fork), are there any plans for integrating ROCm upstream in bitsandbytes? There is currently an ongoing discussion on how backends should be integrated and I think it makes more sense for someone currently working on this to chime in there #898. From what I saw with it being planned slightly higher level with a thin wrapper around torch tensors would help for naming differences like col32 and col, I can also help upstreaming this if this isn't a priority for you.

@Lzy17 @howiejayz @CRobeck @kuhar @jerryyin @keryell @jeffdaily @keryell @dllehr-amd

CRobeck · 2024-01-29T15:30:58Z

@Iron-Bound @amathews-amd is grabbing the right folks on our side.

ehartford · 2024-01-30T00:07:06Z

Thank you! This is huge!

Titus-von-Koeller · 2024-02-27T19:34:45Z

Thanks again everyone for supporting us on upstreaming AMD support.

Just wanted to write again to renew our support in making that work out. Currently, there's a lot of work on deciding on how to best handle the backend abstraction in #1077 and #898. Once we're through with that relatively soon the path is free for follow-up PRs with enabling the individual functionalities in the AMD backend.

Feel free to also chip in on the backend abstraction discussions. This decision will likely set some stuff "in stone", so it is important to get it right.

arlo-phoenix · 2024-04-20T19:21:50Z

The ROCm fork: https://github.com/ROCm/bitsandbytes/tree/rocm_enabled
has merged CMake support.

That fork should work:

under the currently latest commit c037a30
With ROCm 6.0 (6.1 would enable hipblasLt which might break things for non CDNA2+)

for all ROCm supported GPUs. They also marked still failing tests for skipping so all tests under tests/ shouldn't fail (at least did for me on gfx1030).

For installing from an empty venv:

Install requirement ROCm torch

pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.0

Install bitsandbytes

you can find your AMDGPU_TARGET with rocminfo | grep gfx

git clone https://github.com/ROCm/bitsandbytes.git
cd bitsandbytes
git checkout c037a306e97ced3c452570132f66aac4e2964056
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -DCOMPUTE_BACKEND=hip -DAMDGPU_TARGETS=gfx1030 -S .
cmake --build . --config Release
pip install .

I think I've kept this open long enough so I'm closing this. The official fork now has better support for all devices and is easier to find than when I initially discovered it.
I hope my fork and little guide at the end was helpful. I don't plan to actively contribute anything regarding bitsandbytes rocm myself anymore. I'll probably show up again when ROCm gets upstreamed (if RDNA doesn't work), but otherwise: Goodbye and happy finetuning.

Iron-Bound · 2024-04-21T16:04:18Z

Thanks @arlo-phoenix and everyone involved to get this done ❤️

ehartford · 2024-04-28T23:59:26Z

what about 6.1?

IMbackK · 2024-04-29T19:17:46Z

what about it?

Titus-von-Koeller · 2024-05-24T20:52:42Z

For those willing to alpha test, the ROCm backend is already available for that when compiling from source from the multi-backend-refactor branch.

See today's change to the Readme for more details.

ehartford · 2024-05-24T20:59:52Z

Very interested

Airradda mentioned this pull request Oct 23, 2023

Bitsandbytes dos not support ROCm nixified-ai/flake#56

Open

Titus-von-Koeller mentioned this pull request Dec 15, 2023

[RFC] cross-platform: Refactoring bitsandbytes/cuda_setup #918

Closed

arlo-phoenix force-pushed the rocm branch 6 times, most recently from 7c77be4 to 3c9c262 Compare December 16, 2023 10:17

gururise mentioned this pull request Dec 20, 2023

Feature Request: ROCm support (AMD GPU) #107

Open

danielzgtg mentioned this pull request Dec 22, 2023

System shared libraries should be preferred when newer than the bundled ones in the wheel ROCm/pytorch#1340

Open

Titus-von-Koeller added the AMD integration label Dec 27, 2023

arlo-phoenix and others added 7 commits January 22, 2024 16:57

Add ROCm make target

5ee2ec1

ROCM_TARGET=gfx1030 make hip Uses define BITS_AND_BYTES_USE_ROCM to redefine cuda functions to ROCm equivalent credit to previous ports: Co-authored-by: broncotc <i@broncotc.net> Co-authored-by: agrocylo <130291676+agrocylo@users.noreply.github.com>

Add ROCm support to python library

f9e2a84

disables igemm for now and adds path to compiled library libbitsandbytes_hip_nohipblaslt

Add ROCm information to README

78eecb3

Makefile: Add fallback to /opt/rocm home

940c52e

make cut one less

a485a02

Rename BITS_AND_BYTES_USE_ROCM to BNB_USE_HIP

32cd5e0

Adjust kQuantizeBlockwise to work with WARP size 64

e03a8bd

arlo-phoenix force-pushed the rocm branch from 62fa8a8 to e03a8bd Compare January 22, 2024 15:58

Make sure DATA_NUM_PER_TH <= CUB_NUM_PER_TH

ac20c05

the unrolls already somehow worked correctly before, but they shouldn't have.

Titus-von-Koeller mentioned this pull request Feb 4, 2024

[RFC] Cross-Platform Refactor: Overview + Link Hub #997

Closed

matthewdouglas mentioned this pull request Feb 6, 2024

Distribute pip wheels for the architecture they are built for #1043

Closed

cheesesashimi mentioned this pull request Mar 9, 2024

Enable AMD GPU acceleration for local training instructlab/instructlab#459

Closed

JohnDoe02 mentioned this pull request Mar 17, 2024

It's working. Training LORA of the latest version of kohya_ss on AMD GPU,Ubuntu 22.04.2 LTS ,test on RX6800 ,sd1.5&sdxl bmaltais/kohya_ss#1484

Closed

arlo-phoenix closed this Apr 20, 2024

xzuyn mentioned this pull request Apr 20, 2024

Add experimental install guide for ROCm axolotl-ai-cloud/axolotl#1550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ROCm support #756

Add ROCm support #756

arlo-phoenix commented Sep 8, 2023 •

edited

Loading

AloneLiberty commented Sep 23, 2023

amd-zoybai commented Sep 26, 2023 •

edited

Loading

CorvetteCole commented Oct 5, 2023 •

edited

Loading

CorvetteCole commented Oct 5, 2023

arlo-phoenix commented Oct 7, 2023

person4268 commented Oct 8, 2023

arlo-phoenix commented Oct 8, 2023

person4268 commented Oct 9, 2023 •

edited

Loading

ccbadd commented Oct 31, 2023

bog-dan-ro commented Dec 11, 2023

fakerybakery commented Dec 16, 2023

Wintoplay commented Dec 16, 2023

arlo-phoenix commented Dec 16, 2023 •

edited

Loading

Wintoplay commented Dec 21, 2023

arlo-phoenix commented Dec 21, 2023

IMbackK commented Dec 22, 2023 •

edited

Loading

purefire commented Dec 28, 2023

Iron-Bound commented Dec 29, 2023

arlo-phoenix commented Dec 30, 2023

arlo-phoenix commented Jan 12, 2024 •

edited

Loading

IMbackK commented Jan 12, 2024 •

edited

Loading

ehartford commented Jan 28, 2024

arlo-phoenix commented Jan 28, 2024 •

edited

Loading

arlo-phoenix commented Jan 28, 2024 •

edited

Loading

Iron-Bound commented Jan 29, 2024 •

edited

Loading

CRobeck commented Jan 29, 2024

ehartford commented Jan 30, 2024

Titus-von-Koeller commented Feb 27, 2024

arlo-phoenix commented Apr 20, 2024 •

edited

Loading

Iron-Bound commented Apr 21, 2024

ehartford commented Apr 28, 2024

IMbackK commented Apr 29, 2024

Titus-von-Koeller commented May 24, 2024

ehartford commented May 24, 2024

Add ROCm support #756

Add ROCm support #756

Conversation

arlo-phoenix commented Sep 8, 2023 • edited Loading

AloneLiberty commented Sep 23, 2023

amd-zoybai commented Sep 26, 2023 • edited Loading

CorvetteCole commented Oct 5, 2023 • edited Loading

CorvetteCole commented Oct 5, 2023

arlo-phoenix commented Oct 7, 2023

person4268 commented Oct 8, 2023

arlo-phoenix commented Oct 8, 2023

person4268 commented Oct 9, 2023 • edited Loading

ccbadd commented Oct 31, 2023

bog-dan-ro commented Dec 11, 2023

fakerybakery commented Dec 16, 2023

Wintoplay commented Dec 16, 2023

arlo-phoenix commented Dec 16, 2023 • edited Loading

Wintoplay commented Dec 21, 2023

arlo-phoenix commented Dec 21, 2023

IMbackK commented Dec 22, 2023 • edited Loading

purefire commented Dec 28, 2023

Iron-Bound commented Dec 29, 2023

arlo-phoenix commented Dec 30, 2023

arlo-phoenix commented Jan 12, 2024 • edited Loading

IMbackK commented Jan 12, 2024 • edited Loading

ehartford commented Jan 28, 2024

arlo-phoenix commented Jan 28, 2024 • edited Loading

arlo-phoenix commented Jan 28, 2024 • edited Loading

Iron-Bound commented Jan 29, 2024 • edited Loading

CRobeck commented Jan 29, 2024

ehartford commented Jan 30, 2024

Titus-von-Koeller commented Feb 27, 2024

arlo-phoenix commented Apr 20, 2024 • edited Loading

Iron-Bound commented Apr 21, 2024

ehartford commented Apr 28, 2024

IMbackK commented Apr 29, 2024

Titus-von-Koeller commented May 24, 2024

ehartford commented May 24, 2024

arlo-phoenix commented Sep 8, 2023 •

edited

Loading

amd-zoybai commented Sep 26, 2023 •

edited

Loading

CorvetteCole commented Oct 5, 2023 •

edited

Loading

person4268 commented Oct 9, 2023 •

edited

Loading

arlo-phoenix commented Dec 16, 2023 •

edited

Loading

IMbackK commented Dec 22, 2023 •

edited

Loading

arlo-phoenix commented Jan 12, 2024 •

edited

Loading

IMbackK commented Jan 12, 2024 •

edited

Loading

arlo-phoenix commented Jan 28, 2024 •

edited

Loading

arlo-phoenix commented Jan 28, 2024 •

edited

Loading

Iron-Bound commented Jan 29, 2024 •

edited

Loading

arlo-phoenix commented Apr 20, 2024 •

edited

Loading