Add CPU-loaded multi-GPU quantization #289
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
GPU memory is very valuable. Usually, users don't have a lot of it and so they turn to techniques like quantization to run the models they want with their limited amount of GPU memory. However, quantizing those models uses a lot of GPU memory as well.
Aside from full CPU-based quantization, the best technique I've discovered for quantizing models while limiting GPU memory is to load the model into CPU memory and offload those layers into GPU memory one at a time. AutoAWQ allows for this, but only supports using one GPU for the entire quantization process which limits the amount of GPU memory one can utilize to a single GPU's worth.
Here, I've added a few lines of code which allow models loaded in CPU memory which are offloaded to a GPU for quantization, to be offloaded to all GPUs instead. Layers are automatically distributed across all GPUs on the system via a simple algorithm and all GPUs are utilized in the quantization process.
Thanks to this change, I was able to quantize Mixtral 8x7b on my own system of 4x24GB GPUs. Loading the model with
device_map
across all GPUs worked, but quantizing the model in addition to that resulted in a CUDA OOM error. Loading the model withdevice_map
on the CPU also worked, but quantizing the model would offload it to a single 24GB GPU which would only make it through half of the quantization process before CUDA OOM'ing. Loading the model withdevice_map
on the CPU and quantizing the model with these changes, offloads the model to all four of my 24GB GPUs and succeeds in quantizing the model.Model layers are of different sizes and so this simple algorithm is a naive one that can certainly be optimized further to make efficient use of the user's GPU memory. Perhaps this can be done in the future. Additionally, I think one could parallelize the quantization of the CPU-loaded models with this technique and see great performance improvements.