Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container-images: add a kompute variant #235

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

slp
Copy link
Collaborator

@slp slp commented Oct 4, 2024

Add a container variant that enables llama.cpp's kompute backend to enable GPU acceleration using Vulkan.

Tested with krunkit on Apple Silicon with mistral-7b-instruct-v0.2.Q4_0.gguf and Wizard-Vicuna-13B-Uncensored.Q4_0.gguf (>=13B models benefit the most being offloaded to GPU vs. running them on the CPU).

TODO:

  • Teach ramalama to choose the best container image for the context.
  • Ensure every operation works transparently when operating on a container.
  • Add some Q4_0 models to shortnames.conf
  • Expose shortnames.conf into the container.
  • Expose llama.cpp's server port when running in a container.

@rhatdan
Copy link
Member

rhatdan commented Oct 4, 2024

Would it make more sense to layer this on the original image?

from quay.io/ramalama/ramalama:latest
...

@ericcurtin
Copy link
Collaborator

ericcurtin commented Oct 4, 2024

That's not a bad idea @rhatdan, we can just replace the CPU-only binaries with these kompute ones, less duplication.

We do have an issue though, been trying to make these ubi9-based as Scott McCarthy requested, I think that makes sense, but UBI images are missing access to a small set of required packages, we can make it work by pulling the CentOS Stream ones via something like:

FROM quay.io/ramalama/ramalama:latest

ARG LLAMA_CPP_SHA=2a24c8caa6d10a7263ca317fa7cb64f0edc72aae
# renovate: datasource=git-refs depName=ggerganov/whisper.cpp packageName=https://github.com/ggerganov/whisper.cpp gitRef=master versioning=loose type=digest
ARG WHISPER_CPP_SHA=5caa19240d55bfd6ee316d50fbad32c6e9c39528

RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/AppStream/$(uname -m)/os/
RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/BaseOS/$(uname -m)/os/
RUN curl -o /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-Official
RUN rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official

RUN dnf install -y vulkan-headers vulkan-loader-devel vulkan-tools glslc \
      glslang && \
    dnf copr enable -y slp/mesa-krunkit epel-9-$(uname -m) && \
    dnf install -y mesa-vulkan-drivers-24.1.5-101 && \
    dnf clean all && \
    rm -rf /var/cache/*dnf*

ENV GGML_CCACHE=0

RUN git clone --recurse-submodules https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DCMAKE_INSTALL_PREFIX:PATH=/usr -DGGML_KOMPUTE=1 -DGGML_CCACHE=0 && \
    cmake --build build --config Release -j $(nproc) && \
    cmake --install build && \
    cd / && \
    rm -rf llama.cpp

RUN git clone https://github.com/ggerganov/whisper.cpp.git && \
    cd whisper.cpp && \
    git reset --hard ${WHISPER_CPP_SHA} && \
    make -j $(nproc) && \
    mv main /usr/bin/whisper-main && \
    mv server /usr/bin/whisper-server && \
    cd / && \
    rm -rf whisper.cpp

But then it becomes kinda a hybrid UBI9/Stream image.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Oct 4, 2024

Could you turn on x86_64 EPEL9 builds for this also @slp ? :

https://copr.fedorainfracloud.org/coprs/slp/mesa-krunkit/

I wanna try that out with an x86_64 GPU 😄

@rhatdan
Copy link
Member

rhatdan commented Oct 4, 2024

I think EPEL would be a better solution then centos-stream if they are all available.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Oct 4, 2024

Then aren't in EPEL @rhatdan :'( They are in AppStream/BaseOS but not UBI

@ericcurtin
Copy link
Collaborator

ericcurtin commented Oct 4, 2024

UBI repos seem to be a subset of RHEL versions of the repos:

https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi9/9/x86_64/appstream/os/Packages/
https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi9/9/x86_64/baseos/os/Packages/

not sure exactly what defines a package being RHEL-only or RHEL+UBI accessible

@slp
Copy link
Collaborator Author

slp commented Oct 4, 2024

I'm afraid the situation the Vulkan-related packages in CentOS Stream 9 is kind of broken (i.e. spirv-headers-devel is nowhere to be found, even though some packages that depend on it are present in the repos). I can't even rebuild the same Mesa spec in my COPR.

I think F40 is the only option for now. We can easily switch to Stream 9 once the Vulkan situation is fixed, and eventually to ubi9.

@ericcurtin
Copy link
Collaborator

@slp that package in particular looks like it's in EPEL9

@ericcurtin
Copy link
Collaborator

This seemed to work reasonably ok:

FROM quay.io/ramalama/ramalama:latest

ARG LLAMA_CPP_SHA=2a24c8caa6d10a7263ca317fa7cb64f0edc72aae
# renovate: datasource=git-refs depName=ggerganov/whisper.cpp packageName=https://github.com/ggerganov/whisper.cpp gitRef=master versioning=loose type=digest
ARG WHISPER_CPP_SHA=5caa19240d55bfd6ee316d50fbad32c6e9c39528

RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/AppStream/$(uname -m)/os/
RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/BaseOS/$(uname -m)/os/
RUN curl -o /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-Official
RUN rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official

RUN dnf install -y vulkan-headers vulkan-loader-devel vulkan-tools glslc \
      glslang epel-release && \
    dnf install -y spirv-headers-devel && \
    dnf copr enable -y slp/mesa-krunkit epel-9-$(uname -m) && \
    dnf install -y mesa-vulkan-drivers-24.1.5-101 && \
    dnf clean all && \
    rm -rf /var/cache/*dnf*

ENV GGML_CCACHE=0

RUN git clone --recurse-submodules https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DCMAKE_INSTALL_PREFIX:PATH=/usr -DGGML_KOMPUTE=1 -DGGML_CCACHE=0 && \
    cmake --build build --config Release -j $(nproc) && \
    cmake --install build && \
    cd / && \
    rm -rf llama.cpp

RUN git clone https://github.com/ggerganov/whisper.cpp.git && \
    cd whisper.cpp && \
    git reset --hard ${WHISPER_CPP_SHA} && \
    make -j $(nproc) && \
    mv main /usr/bin/whisper-main && \
    mv server /usr/bin/whisper-server && \
    cd / && \
    rm -rf whisper.cpp

@rhatdan
Copy link
Member

rhatdan commented Oct 5, 2024

Why are you rebuilding the whisper-server? Isn't the one in the parent layer the same?

Or do you have to compile it differently for kompute?

@rhatdan
Copy link
Member

rhatdan commented Oct 5, 2024

Who is using the dnf install -y spirv-headers-devel package?

Should this be just used during build and removed before finished?

@ericcurtin
Copy link
Collaborator

ericcurtin commented Oct 6, 2024

So it would be more like this:

FROM quay.io/ramalama/ramalama:latest

RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/AppStream/$(uname -m)/os/
RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/BaseOS/$(uname -m)/os/
RUN curl -o /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-Official
RUN rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official

RUN dnf install -y vulkan-headers vulkan-loader-devel vulkan-tools glslc \
      glslang epel-release && \
    dnf install -y spirv-headers-devel && \
    dnf copr enable -y slp/mesa-krunkit epel-9-$(uname -m) && \
    dnf install -y mesa-vulkan-drivers-24.1.5-101 && \
    dnf clean all && \
    rm -rf /var/cache/*dnf*

RUN git clone --recurse-submodules https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DCMAKE_INSTALL_PREFIX:PATH=/usr -DGGML_KOMPUTE=1 -DGGML_CCACHE=0 && \
    cmake --build build --config Release -j $(nproc) && \
    cmake --install build && \
    cd / && \
    rm -rf llama.cpp

But I guess there may be other reasons @slp cannot build the latest variant for EPEL9 I guess... I notice only an older version is build for EPEL9 aarch64...

If RHEL9/RHEL10 won't be suitable for this kind of thing, it would be nice to document somewhere was packages are missing in both RHEL9 and RHEL10 and if they are in EPEL, etc.

@ericcurtin
Copy link
Collaborator

Do RHEL/RHEL AI customers have access to full-RHEL containers? Not just the subset of packages in UBI, this is also something I am curious about...

@rhatdan
Copy link
Member

rhatdan commented Oct 7, 2024

Yes they have full access to RHEL content.

Install Vulkan dependencies in the container and build llama.cpp
with GGML_KOMPUTE=1 to enable the kompute backend. This enables
users with a Vulkan-capable GPU to optionally offload part of the
workload to it.

Signed-off-by: Sergio Lopez <slp@redhat.com>
@slp
Copy link
Collaborator Author

slp commented Oct 9, 2024

v2:

  • Extend the existing Containerfile instead of building a new one.
  • Use ubi9 instead of Fedora 40.
  • Add a cli option to request GPU offloading.

@slp slp marked this pull request as ready for review October 9, 2024 14:12
@@ -6,17 +6,15 @@ ARG HUGGINGFACE_HUB_VERSION=0.25.2
ARG OMLMD_VERSION=0.1.5
# renovate: datasource=github-releases depName=tqdm/tqdm extractVersion=^v(?<version>.*)
ARG TQDM_VERSION=4.66.5
ARG LLAMA_CPP_SHA=70392f1f81470607ba3afef04aa56c9f65587664
ARG LLAMA_CPP_SHA=2a24c8caa6d10a7263ca317fa7cb64f0edc72aae
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a downgrade. Do we need to downgrade?

@@ -100,6 +99,18 @@ def run(self, args):
if not args.ARGS:
exec_args.append("-cnv")

if args.gpu:
if sys.platform == "darwin":
Copy link
Collaborator

@ericcurtin ericcurtin Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We introduced -ngl 99 on macOS because Tim deBoer only saw max utilization of his GPU with ngl 99

But... I never saw this on my M3 pro, seemed the same regardless, so I suspect your change here is right

ramalama/model.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ericcurtin ericcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we need the llama.cpp downgrade and would be nice to see the total size difference between cpuonly and a cpuonly+kompute container image

@slp
Copy link
Collaborator Author

slp commented Oct 9, 2024

I'm curious if we need the llama.cpp downgrade and would be nice to see the total size difference between cpuonly and a cpuonly+kompute container image

Sadly, we do need the downgrade because the kompute backend it's broken in master. This happens frequently with pretty much every backend except cpu. I want to take a look to see if I can fix it myself in master, but can't make any promises.

As for the size, this is what I'm getting here:

localhost/ramalama-cpuonly           latest               56a6236affe0  About a minute ago  660 MB
localhost/ramalama-kompute           latest               c312022f66a5  About an hour ago   862 MB

@slp
Copy link
Collaborator Author

slp commented Oct 9, 2024

Should I add a commit renaming the directory? Feels weird keeping the name cpuonly

@rhatdan
Copy link
Member

rhatdan commented Oct 9, 2024

Yes rename, I would prefer to keep one image for CPU and Kompute with the limited size difference.

Add a "--gpu" that allows users to request the workload to be
offloaded to the GPU. This works natively on macOS using Metal and
in containers using Vulkan with llama.cpp's Kompute backend.

Signed-off-by: Sergio Lopez <slp@redhat.com>
To be able to properly expose the port outside the container, we
need to pass "--host 0.0.0.0" to llama.cpp.

Signed-off-by: Sergio Lopez <slp@redhat.com>
@ericcurtin
Copy link
Collaborator

cpuonly + kompute (merged image) = generic

Just a suggestion...

Now that it also provides Vulkan support, the "cpuonly" name no
longer describes it properly. Let's rename it to "generic".

Signed-off-by: Sergio Lopez <slp@redhat.com>
@rhatdan
Copy link
Member

rhatdan commented Oct 9, 2024

I would rather name it ramalama or vulcan.
If ramalama then others can build their Vendor specific images from it.

# any additional arguments.
pass
elif sys.platform == "linux" and Path("/dev/dri").exists():
if "q4_0.gguf" not in model_path.lower():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting... Should this condition apply to all GPU frameworks, CUDA, ROCm and Vulkan/Kompute or just Kompute?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for kompute, so we need to make model.py aware of the variant it's running.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we just removed the q4_0.gguf check... If the model fails, it fails...

Many models are not named like this, none of the ollama ones are for example...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, lets remove the check.

@ericcurtin ericcurtin mentioned this pull request Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants