[Fix] Fix the KV Cache insertion logic for quantized OPT #1648

dbogunowicz · 2023-07-03T09:16:46Z

Feature Description

Fixes the KV Cache logic for quantized text generation models.

In a nutshell, the QuantizeLinear nodes that were created in the process of quantization were messing up our pattern-matching rules for finding Key Matmul and Value Matmul. Additionally the QuantizeLinear nodes are by default inserted into a graph in such a way, that they break the topology of the graph with the kv cache support.

This fix:

updates the pattern matching for finding Key Matmul and Value MatMul to be robust against the presence of QuantizeLinear nodes
moves the QuantizeLinear nodes to their appropriate place in the onnx graph.

Manual Testing

Note: For manual testing it is required to use this branch: neuralmagic/deepsparse#1123. It provides the appropriate support for quantized kv cache.

OPT

Tested with the model provided by @natuan nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513

Injecting the kv cache:

python kv_cache_injector.py --input-file /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment/model.onnx --output-file /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment/model_kvcache.onnx

2023-07-18 09:58:01 sparseml.exporters.transforms.kv_cache.configs INFO     Loaded config file /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment/config.json for model: opt
2023-07-18 09:58:01 sparseml.exporters.transforms.kv_cache.configs INFO     Properly configured arguments for KV Cache Transformation
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
2023-07-18 09:58:03 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 48 matches
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
2023-07-18 09:58:04 sparseml.exporters.transforms.onnx_transform INFO     [PositionsAdjustmentOPT] Transformed 3 matches
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
Modified model saved to: /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment/model_kvcache.onnx

Renaming deployment/model_kvcache.onnx to deployment/model.onnx and running it in the pipeline.

opt = Pipeline.create(task="opt", model_path="/home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A 8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment", engine_type=engine_type, use_deepsparse_cache = False, max_generated_tokens=32)

out = opt(sequences="Who is the president of the United States?")
print(out.sequences[0])

Who is the president of the United States?

Who is the president of the United States?

Who is the president of the United States

Note: This PR does not alter the behavior of KV cache injection for non-quantized OPT model.

CodeGen

Tested with the model provided by @shubhra codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/

Injecting the kv cache:

python kv_cache_injector.py --input-file /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/model.onnx --output-file /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/model_kvcache.onnx

2023-07-18 11:22:34 sparseml.exporters.transforms.kv_cache.configs INFO     Loaded config file /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/config.json for model: codegen
2023-07-18 11:22:34 sparseml.exporters.transforms.kv_cache.configs INFO     Properly configured arguments for KV Cache Transformation
2023-07-18 11:22:35 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 40 matches
2023-07-18 11:22:38 sparseml.exporters.transforms.onnx_transform INFO     [PositionsAdjustmentCodeGen] Transformed 3 matches
Modified model saved to: /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/model_kvcache.onnx

Renaming training/model_kvcache.onnx to training/model.onnx and running it in the pipeline.

opt = Pipeline.create(task="codegen", model_path="/home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training", engine_type=engine_type, use_deepsparse_cache = False, max_generated_tokens=32)

out = opt(sequences="def hello_world():")
print(out.sequences[0])

print("Hello World")

hello_world()

# This is a comment

# This is a comment

# This is

Note: This PR does not alter the behavior of KV cache injection for non-quantized CodeGen model.

…opt_cache

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py

initial commit

5dc1977

dbogunowicz requested review from bfineran and natuan July 3, 2023 09:21

bfineran previously approved these changes Jul 3, 2023

View reviewed changes

Merge branch 'main' into fix/damian/quantized_opt_cache

d3c6038

natuan previously approved these changes Jul 5, 2023

View reviewed changes

swapped transpose and quantizelienar

87d03f9

dbogunowicz dismissed stale reviews from natuan and bfineran via 87d03f9 July 10, 2023 17:35

dbogunowicz and others added 2 commits July 11, 2023 13:51

tiptoeing towards the fix

d2fffbd

cleanup, came up with a better idea for a fix

6fcd3f2

dbogunowicz force-pushed the fix/damian/quantized_opt_cache branch from e9fd0ff to 6fcd3f2 Compare July 14, 2023 16:12

revert a mistake

de8ebf7

dbogunowicz force-pushed the fix/damian/quantized_opt_cache branch from 3370f21 to de8ebf7 Compare July 14, 2023 16:14

dbogunowicz and others added 4 commits July 14, 2023 18:14

Delete hehe2.py

d79fec2

producing good looking graph lets test in deepsparse

53bfcd3

clean implementation, working in opt

5e8c649

Merge remote-tracking branch 'origin/main' into fix/damian/quantized_…

d2537a3

…opt_cache

bfineran reviewed Jul 17, 2023

View reviewed changes

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py Show resolved Hide resolved

bogunowicz@arrival.com added 2 commits July 18, 2023 09:11

simplify the PR

cf201e5

ready for rereview

4a83566

dbogunowicz mentioned this pull request Jul 17, 2023

[Text Generation] Detect dtype of kv cache (float32/uint8) for text generation models neuralmagic/deepsparse#1123

Merged

dbogunowicz requested review from bfineran, natuan and shubhra July 18, 2023 10:54

bfineran approved these changes Jul 18, 2023

View reviewed changes

Merge branch 'main' into fix/damian/quantized_opt_cache

25c5405

natuan approved these changes Jul 19, 2023

View reviewed changes

natuan merged commit 6782b03 into main Jul 19, 2023
10 checks passed

natuan deleted the fix/damian/quantized_opt_cache branch July 19, 2023 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Fix the KV Cache insertion logic for quantized OPT #1648

[Fix] Fix the KV Cache insertion logic for quantized OPT #1648

dbogunowicz commented Jul 3, 2023 •

edited

Loading

[Fix] Fix the KV Cache insertion logic for quantized OPT #1648

[Fix] Fix the KV Cache insertion logic for quantized OPT #1648

Conversation

dbogunowicz commented Jul 3, 2023 • edited Loading

Feature Description

Manual Testing

OPT

CodeGen

dbogunowicz commented Jul 3, 2023 •

edited

Loading