-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Marlin-quantized models #2014
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This change adds support for Marlin-quantized models. Marlin is an FP16xINT4 matmul kernel, which provides good speedups decoding batches of 16-32 tokens. It supports quantized models with symmetric quantization, groupsize -1 or 128, and 4-bit. Tested with: - Llama 2 - Llama 3 - Phi 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not ready, but it looks very solid.
Un-drafted. I was checking if the build worked, but it seems that there is currently only an unrelated error in the Intel build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@pytest.fixture(scope="module") | ||
def flash_llama_marlin_handle(launcher): | ||
with launcher( | ||
"neuralmagic/llama-2-7b-chat-marlin", num_shard=2, quantize="marlin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using llama2 instead of llama3 (or other new models?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly because it was the only neuralmagic llama model, I'll find something else for the next PR.
marlin_commit := 2f6d7c10e124b3c5fa29ff8d77d568bd7af3274c | ||
|
||
marlin: | ||
# Clone marlin | ||
pip install packaging | ||
git clone https://github.com/IST-DASLab/marlin.git marlin | ||
|
||
build-marlin: marlin | ||
cd marlin && git fetch && git checkout $(marlin_commit) | ||
cd marlin && python setup.py build | ||
|
||
install-marlin: build-marlin | ||
cd marlin && python setup.py install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a new format in flash-attention and vllm which try to be slightly smarter (to skip rebuilding when already built and not fail when the directory already existss)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll follow the same format.
What does this PR do?
This change adds support for Marlin-quantized models. Marlin is an FP16xINT4 matmul kernel, which provides good speedups decoding batches of 16-32 tokens. It supports quantized models with symmetric quantization, groupsize -1 or 128, and 4-bit.
Tested with:
Draft checking whether all tests still pass.
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.