-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA GPU Support #11
Conversation
@mayankchhabra I'm leaving this in draft mode since merging it would break model downloads. But this is working for me and wanted to put it out there in case this was helpful in getting GPU support going for anyone else. |
This will likely need to be rebased once work to implement the changes outlined in #8 (comment) is complete. |
Absolutely. I can rework this depending on what happens there around model downloading, the storage path, etc. I think it'd make a lot of sense to pass in the model to run and the number of GPU layers to offload as commandline params/env vars, but I'll hold off until #8 is resolved. |
Reopening this. I'll continue to push changes/rebase once some of the more ergonomic PRs are merged (e.g. model downloads). |
Thanks for taking this on @edgar971! Now that #19 has been merged, would you like to work on this? Here's another helpful comment from a user running it with CUDA support: #6 (comment) I think the easiest way to get going with CUDA support could be to create separate |
We can probably use the same |
Can i use the N_GPU_LAYERS for the kubernetes api deployment ? |
I think a decision to be made here would be whether you want to base the whole project off of the nvidia:cuda docker images, or if you prefer to have separate dockerfiles for cuda support. Last I checked, the llama-cpp-python project supports GPU offload, but their GHCR docker image does not. So you'd be changing base images for the project here, or implement some kind of conditional to select the correct dockerfile based off of some input/env var. |
For the sake of simplicity, I think using separate Dockerfile and docker-compose files for CUDA would be great. We can then add relevant instructions to the readme. A comprehensive refactor down the line can combine everything into a simple run.sh script that uses template docker-compose files based on system config. |
Thanks for helping kickstart the effort on this @edicristofaro! We were able to add CUDA support with #72. Closing this PR now. Cheers! |
I don't see how this will work for Kubernetes deployments... |
This is a quick PR to show how NVIDIA GPU support would work. You may not want to merge this since it also removes the model download steps and presumes you already have them, but it might serve as a good baseline. This also presumes that you've configured Docker to work with GPUs (see here: https://www.docker.com/blog/wsl-2-gpu-support-for-docker-desktop-on-nvidia-gpus/).
My setup is:
I'm able to run the 13B and 70B models, albeit slowly for the latter, with some offload to the GPU.