-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KV cache] Base, generic KV Cache injection support #1559
Conversation
status so far - confirmed the 'generic' pattern matching works for OPT, next step is adding the injection |
@dbogunowicz initial implementation of KV cache concats + OPT Cache length adjustment completed for some reason the exporter I wrote was cleared in my environment, will get that up at a later date import onnx
from sparseml.exporters.transforms.kv_cache import *
model = onnx.load("/home/benjamin/tmp-models/small_decoder_opt.onnx", load_external_data=False)
model = CacheKeysAndValues().transform(model)
model = OPTCacheLengthAdjustment().transform(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbogunowicz quick comments, pushing up adjustment to reshape
# no great way to generically infer this from the graph since transposes can | ||
# be used to place it on either side of the matmul | ||
# hardcoding for now, will update to have a hardcoded value for each model type | ||
_KEY_NODE_INPUT_IDX = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we take this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now it can be hard-coded as an argument to the function - no changes between CodeGen and OPT
src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py
Outdated
Show resolved
Hide resolved
src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py
Outdated
Show resolved
Hide resolved
src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py
Outdated
Show resolved
Hide resolved
src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Building KV cache injection in ONNX feels like the ML equivalent of Chris Sawyer creating RollerCoaster Tycoon in assembly, but this was clean and easy to follow. Left a couple non-blocking comments
""" | ||
graph = ONNXGraph(model) | ||
|
||
if node.op_type == "QLinearMatMul" and cache_input_idx == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to check if cache_input_idx == 1
here instead of setting to 3 regardless?
* Update __init__.py * [KV cache] Base, generic KV Cache injection support (#1559) * [WIP] Inject core cache ops - pattern matching + base export * complete initial implementation of CacheKeysAndValues * documentation + suggestions from Damian * Cache length adjustment - ABC + OPT impl * typo * little cleanup, more importantly, started testing * stuck on testing * move KV cache concat to before transpose where applicable * working for dynamic seq len * add support for slicing cache by actual length * add position_embeddings_adjustment * quantized model support * Support Q/DQ folding of Parameterized matmuls w/o bias add * delete Exporter for KV cache for now - since checker doesn't pass, will do transforms ad-hoc * quality * refactor * fix docstrings * hardening the validation * validator not needed --------- Co-authored-by: Damian <damian@neuralmagic.com> Co-authored-by: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com> * [KV cache] Properly set the static dimensions of the kv cache inputs/outputs (#1573) * [WIP] Inject core cache ops - pattern matching + base export * complete initial implementation of CacheKeysAndValues * documentation + suggestions from Damian * Cache length adjustment - ABC + OPT impl * typo * little cleanup, more importantly, started testing * stuck on testing * move KV cache concat to before transpose where applicable * working for dynamic seq len * add support for slicing cache by actual length * add position_embeddings_adjustment * quantized model support * Support Q/DQ folding of Parameterized matmuls w/o bias add * delete Exporter for KV cache for now - since checker doesn't pass, will do transforms ad-hoc * quality * refactor * initial commit * fix docstrings * hardening the validation * validator not needed * adressing PR comments --------- Co-authored-by: Benjamin <ben@neuralmagic.com> Co-authored-by: bogunowicz@arrival.com <bogunowicz@arrival.com> * [KV cache] Input/output KV cache to include `batch` dimension. (#1589) * [WIP] Inject core cache ops - pattern matching + base export * complete initial implementation of CacheKeysAndValues * documentation + suggestions from Damian * Cache length adjustment - ABC + OPT impl * typo * little cleanup, more importantly, started testing * stuck on testing * move KV cache concat to before transpose where applicable * working for dynamic seq len * add support for slicing cache by actual length * add position_embeddings_adjustment * quantized model support * Support Q/DQ folding of Parameterized matmuls w/o bias add * delete Exporter for KV cache for now - since checker doesn't pass, will do transforms ad-hoc * quality * refactor * initial commit * fix docstrings * initial commit * hardening the validation * validator not needed * tested with deepsparse * ready for reviews * adressing PR comments * addressing PR comments * ready to land --------- Co-authored-by: Benjamin <ben@neuralmagic.com> Co-authored-by: bogunowicz@arrival.com <bogunowicz@arrival.com> * remove changes --------- Co-authored-by: Konstantin Gulin <66528950+KSGulin@users.noreply.github.com> Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com> Co-authored-by: Benjamin <ben@neuralmagic.com> Co-authored-by: bogunowicz@arrival.com <bogunowicz@arrival.com>
Feature Preview:
The implementation above guarantees full "onnx checker safety".
However, to allow the user to run the transform faster / in insolation, the following
path is also enabled:
Note: this will raise multiple warnings, making the user conscious of the fact, that the models with
load_external_data=True
, cannot be properly validated.Additional changes:
MatMulAddToMatMulIntegerAddCastMul
a bit more generic by making the Add portion optional - should look into renaming this transform...Testing
This functionality has been tested multiple times for OPT models dense/sparse/quantize:
With quantized models, we have observed that the kv cache export leads to the presence of the "dead MatMuls":
However, the current head of the branch does not produce "dead MatMuls" anymore. Does that mean that the problem has been solved along the way?
Also, the quantized models run fine in the pipeline (tested with ORT)