Improve the handling of quantized weights #2250

danieldk · 2024-07-18T14:43:08Z

What does this PR do?

Handling of quantized weights was split between two mechanisms:

For quantized checkpoints, we used the new weight loader infrastructure.
For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in get_linear.

Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by get_linear, which string-checks quantizer. Also, the context manager would not work with EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure. This has several benefits:

We can use context managers with all quantizers.
All the implementation details move down to the quantizer layers, get_linear does not need to know how to handle quantizer linear layers.
All quantizer weights are strongly typed, we don't pass around raw tensors.
We don't have to pass around the quantizer string everywhere.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere.

OlivierDehaene · 2024-07-18T14:48:46Z

integration-tests/models/__snapshots__/test_chat_llama/test_flash_llama_simple.json

@@ -5,19 +5,19 @@
      "index": 0,
      "logprobs": null,
      "message": {
-        "content": "As of your last question, the weather in Brooklyn, New York, is typically hot and humid throughout the year. The suburbs around New York City are jealously sheltered, and at least in the Lower Bronx, there are very few outdoor environments to explore in the middle of urban confines. In fact, typical times for humidity levels in Brooklyn include:\n\n- Early morning: 80-85% humidity, with occas",
+        "content": "As of your last question, the weather in Brooklyn, New York, is typically moderate to warm year-round. The suburban areas around the borough are jealously sheltered from the Northeastern United States' harsh wind and rain systems. In fact, Brooklyn vs the urban confines of Manhattan or Staten Island is energized by an idyllic summer that often sees crisp East Coast air embraced by the mild atmosphere, year after",


Do you know why this changed?

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py

server/text_generation_server/layers/fp8.py

OlivierDehaene

Nice!

* Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama

danieldk force-pushed the refactor/quantization-handling branch from 5bbbce9 to e22f411 Compare July 18, 2024 14:50

OlivierDehaene previously approved these changes Jul 18, 2024

View reviewed changes

OlivierDehaene self-requested a review July 18, 2024 14:57

danieldk dismissed OlivierDehaene’s stale review via 8ebec90 July 18, 2024 15:17

danieldk force-pushed the refactor/quantization-handling branch from e22f411 to 8ebec90 Compare July 18, 2024 15:17

OlivierDehaene reviewed Jul 18, 2024

View reviewed changes

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py Show resolved Hide resolved

OlivierDehaene reviewed Jul 18, 2024

View reviewed changes

server/text_generation_server/layers/fp8.py Outdated Show resolved Hide resolved

danieldk force-pushed the refactor/quantization-handling branch 2 times, most recently from 59fc128 to d819a3c Compare July 18, 2024 15:25

Exclude non-MLP layers when using FP8 quantization with Llama

cf16172

danieldk force-pushed the refactor/quantization-handling branch from d819a3c to cf16172 Compare July 18, 2024 16:04

OlivierDehaene approved these changes Jul 18, 2024

View reviewed changes

danieldk merged commit ba291da into main Jul 19, 2024
9 checks passed

danieldk deleted the refactor/quantization-handling branch July 19, 2024 07:37

danieldk mentioned this pull request Jul 19, 2024

feat(fp8): use fbgemm kernels and load fp8 weights directly #2248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the handling of quantized weights #2250

Improve the handling of quantized weights #2250

danieldk commented Jul 18, 2024

OlivierDehaene Jul 18, 2024

OlivierDehaene left a comment

Improve the handling of quantized weights #2250

Improve the handling of quantized weights #2250

Conversation

danieldk commented Jul 18, 2024

What does this PR do?

Before submitting

Who can review?

OlivierDehaene Jul 18, 2024

Choose a reason for hiding this comment

OlivierDehaene left a comment

Choose a reason for hiding this comment