Improvements in the quantizer and dequantization kernel #1061

angeloskath · 2024-05-01T00:38:06Z

This PR has two contributions both working together for what is hopefully better quantization performance across the board.

We change the way we compute the scale and bias for the block as follows.
a. We set the bias to the min or max value depending on which has the higher absolute value.
b. We set the scale to go from min to max or max to min respectively.
c. We adjust the scale to make sure that 0 is quantized as 0.
For the dequantization, since the scale is usually a float16, dividing it by 4096 destroys significant information and causes quantization errors. We change it so that it is divided by 16. The same issue does not happen in qmv where everything is float32.

Quantization performance

This is the quantization performance on Wikitext-2 test set. The Q4_0 performance is computed by quantizing and dequantizing the weights in place with absmax quantization and block size 32.

Regarding the block size discussion (which I cannot find now @ivanfioravanti) I think 64 is a good compromise for a default and 32 should be evaluated and used if the 64 performance is not adequate. Wdyt @awni and @jagrit06 ?

Throughput

The kernel change actually has no performance degradation whatsoever

Before

$ python benchmarks/python/comparative/bench_mlx.py quant_matmul_t_64_4 --size 4096x4096 --size 4096x512 --size 4096x64 --size 4096x64 --dtype float16 --dtype uint32 --dtype float16 --dtype float16
6.557293891906738
$ python -m mlx_lm.lora --model mlx-community/NeuralBeagle14-7B-4bit-mlx --train --data ../../lora/data/
...
...
Iter 1: Val loss 2.866, Val took 8.981s
Iter 10: Train loss 2.323, Learning Rate 1.000e-05, It/sec 1.882, Tokens/sec 752.791, Trained Tokens 3999, Peak mem 6.265 GB
Iter 20: Train loss 1.691, Learning Rate 1.000e-05, It/sec 1.732, Tokens/sec 698.554, Trained Tokens 8032, Peak mem 6.265 GB

After

$ python benchmarks/python/comparative/bench_mlx.py quant_matmul_t_64_4 --size 4096x4096 --size 4096x512 --size 4096x64 --size 4096x64 --dtype float16 --dtype uint32 --dtype float16 --dtype float16
6.5276172161102295
$ python -m mlx_lm.lora --model mlx-community/NeuralBeagle14-7B-4bit-mlx --train --data ../../lora/data/
...
...
Iter 1: Val loss 2.834, Val took 8.946s
Iter 10: Train loss 2.334, Learning Rate 1.000e-05, It/sec 1.880, Tokens/sec 751.839, Trained Tokens 3999, Peak mem 6.265 GB
Iter 20: Train loss 1.699, Learning Rate 1.000e-05, It/sec 1.741, Tokens/sec 702.182, Trained Tokens 8032, Peak mem 6.265 GB

jagrit06

💯

awni

This is awesome.

angeloskath added 2 commits April 30, 2024 15:03

Update the quantization op and the dequantization kernel

6a5d464

Update the test again to match the new quantization

9a92f6f

jagrit06 approved these changes May 1, 2024

View reviewed changes

awni approved these changes May 1, 2024

View reviewed changes

angeloskath merged commit 17f57df into main May 2, 2024
3 checks passed

angeloskath deleted the quantize branch May 2, 2024 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements in the quantizer and dequantization kernel #1061

Improvements in the quantizer and dequantization kernel #1061

angeloskath commented May 1, 2024

jagrit06 left a comment

awni left a comment

Improvements in the quantizer and dequantization kernel #1061

Improvements in the quantizer and dequantization kernel #1061

Conversation

angeloskath commented May 1, 2024

Quantization performance

Throughput

jagrit06 left a comment

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment