Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Fix the KV Cache insertion logic for quantized OPT #1648

Merged
merged 13 commits into from
Jul 19, 2023

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Jul 3, 2023

Feature Description

Fixes the KV Cache logic for quantized text generation models.

In a nutshell, the QuantizeLinear nodes that were created in the process of quantization were messing up our pattern-matching rules for finding Key Matmul and Value Matmul. Additionally the QuantizeLinear nodes are by default inserted into a graph in such a way, that they break the topology of the graph with the kv cache support.

This fix:

  • updates the pattern matching for finding Key Matmul and Value MatMul to be robust against the presence of QuantizeLinear nodes
  • moves the QuantizeLinear nodes to their appropriate place in the onnx graph.
image image image image image image

Manual Testing

Note: For manual testing it is required to use this branch: neuralmagic/deepsparse#1123. It provides the appropriate support for quantized kv cache.

OPT

Tested with the model provided by @natuan nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513

  1. Injecting the kv cache:
python kv_cache_injector.py --input-file /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment/model.onnx --output-file /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment/model_kvcache.onnx
2023-07-18 09:58:01 sparseml.exporters.transforms.kv_cache.configs INFO     Loaded config file /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment/config.json for model: opt
2023-07-18 09:58:01 sparseml.exporters.transforms.kv_cache.configs INFO     Properly configured arguments for KV Cache Transformation
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
2023-07-18 09:58:03 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 48 matches
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
2023-07-18 09:58:04 sparseml.exporters.transforms.onnx_transform INFO     [PositionsAdjustmentOPT] Transformed 3 matches
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
Modified model saved to: /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment/model_kvcache.onnx
  1. Renaming deployment/model_kvcache.onnx to deployment/model.onnx and running it in the pipeline.
opt = Pipeline.create(task="opt", model_path="/home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A 8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513/deployment", engine_type=engine_type, use_deepsparse_cache = False, max_generated_tokens=32)

out = opt(sequences="Who is the president of the United States?")
print(out.sequences[0])
Who is the president of the United States?

Who is the president of the United States?

Who is the president of the United States

Note: This PR does not alter the behavior of KV cache injection for non-quantized OPT model.

CodeGen

Tested with the model provided by @shubhra codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/

  1. Injecting the kv cache:
python kv_cache_injector.py --input-file /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/model.onnx --output-file /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/model_kvcache.onnx
2023-07-18 11:22:34 sparseml.exporters.transforms.kv_cache.configs INFO     Loaded config file /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/config.json for model: codegen
2023-07-18 11:22:34 sparseml.exporters.transforms.kv_cache.configs INFO     Properly configured arguments for KV Cache Transformation
2023-07-18 11:22:35 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 40 matches
2023-07-18 11:22:38 sparseml.exporters.transforms.onnx_transform INFO     [PositionsAdjustmentCodeGen] Transformed 3 matches
Modified model saved to: /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/model_kvcache.onnx
  1. Renaming training/model_kvcache.onnx to training/model.onnx and running it in the pipeline.
opt = Pipeline.create(task="codegen", model_path="/home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training", engine_type=engine_type, use_deepsparse_cache = False, max_generated_tokens=32)

out = opt(sequences="def hello_world():")
print(out.sequences[0])
print("Hello World")

hello_world()

# This is a comment

# This is a comment

# This is

Note: This PR does not alter the behavior of KV cache injection for non-quantized CodeGen model.

bfineran
bfineran previously approved these changes Jul 3, 2023
natuan
natuan previously approved these changes Jul 5, 2023
@dbogunowicz dbogunowicz dismissed stale reviews from natuan and bfineran via 87d03f9 July 10, 2023 17:35
@dbogunowicz dbogunowicz force-pushed the fix/damian/quantized_opt_cache branch from e9fd0ff to 6fcd3f2 Compare July 14, 2023 16:12
@dbogunowicz dbogunowicz force-pushed the fix/damian/quantized_opt_cache branch from 3370f21 to de8ebf7 Compare July 14, 2023 16:14
@natuan natuan merged commit 6782b03 into main Jul 19, 2023
10 checks passed
@natuan natuan deleted the fix/damian/quantized_opt_cache branch July 19, 2023 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants