-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Fix the KV Cache insertion logic for quantized OPT #1648
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bfineran
previously approved these changes
Jul 3, 2023
natuan
previously approved these changes
Jul 5, 2023
dbogunowicz
force-pushed
the
fix/damian/quantized_opt_cache
branch
from
July 14, 2023 16:12
e9fd0ff
to
6fcd3f2
Compare
dbogunowicz
force-pushed
the
fix/damian/quantized_opt_cache
branch
from
July 14, 2023 16:14
3370f21
to
de8ebf7
Compare
bfineran
reviewed
Jul 17, 2023
bfineran
approved these changes
Jul 18, 2023
natuan
approved these changes
Jul 19, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Feature Description
Fixes the KV Cache logic for quantized text generation models.
In a nutshell, the
QuantizeLinear
nodes that were created in the process of quantization were messing up our pattern-matching rules for findingKey Matmul
andValue Matmul
. Additionally theQuantizeLinear
nodes are by default inserted into a graph in such a way, that they break the topology of the graph with the kv cache support.This fix:
Key Matmul
andValue MatMul
to be robust against the presence ofQuantizeLinear
nodesQuantizeLinear
nodes to their appropriate place in the onnx graph.Manual Testing
Note: For manual testing it is required to use this branch: neuralmagic/deepsparse#1123. It provides the appropriate support for quantized kv cache.
OPT
Tested with the model provided by @natuan
nlg-text_generation/c4-opt_1.3b-pruned50_quant/sparsegpt@opt-1.3b@c4@opt1.3b.W8A8linear.A8A8O16matmul@SP0.5@SQ1@PTQ1@ID15513
deployment/model_kvcache.onnx
todeployment/model.onnx
and running it in the pipeline.Note: This PR does not alter the behavior of KV cache injection for non-quantized OPT model.
CodeGen
Tested with the model provided by @shubhra
codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/
training/model_kvcache.onnx
totraining/model.onnx
and running it in the pipeline.Note: This PR does not alter the behavior of KV cache injection for non-quantized CodeGen model.