New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[KV cache] Base, generic KV Cache injection support #1559

Merged

dbogunowicz merged 26 commits into feature/damian/fb_kv_cache from kv-cache-injection

Jun 12, 2023

Member

bfineran commented May 11, 2023 •

edited by dbogunowicz

Loading

Feature Preview:

import onnx
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector

model = onnx.load("deployment/model.onnx")
model = KeyValueCacheInjector(model_type="opt").apply(model)
onnx.save(model, "deployment/model_kvcache.onnx")

The implementation above guarantees full "onnx checker safety".
However, to allow the user to run the transform faster / in insolation, the following
path is also enabled:

import onnx
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
model = onnx.load("deployment/model.onnx", load_external_data=False) # we operate only on the model graph
model = KeyValueCacheInjector(model_type="opt").apply(model)
onnx.save(model, "deployment/model_kvcache.onnx")

Note: this will raise multiple warnings, making the user conscious of the fact, that the models with load_external_data=True, cannot be properly validated.

Additional changes:

makes MatMulAddToMatMulIntegerAddCastMul a bit more generic by making the Add portion optional - should look into renaming this transform...

Testing

This functionality has been tested multiple times for OPT models dense/sparse/quantize:

model = onnx.load("/network/damian/sparsegpt_webinar_fp32/sparsegpt_1.3b/model.onnx")
model = KeyValueCacheInjector(model_type = "opt").apply(model)
onnx.save(model, "temp.onnx")
onnx.checker.check_model("temp.onnx")

2023-05-29 11:26:27 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 48 matches
2023-05-29 11:26:29 sparseml.exporters.transforms.onnx_transform INFO     [PositionEmbeddingsAdjustment] Transformed 5 matches

With quantized models, we have observed that the kv cache export leads to the presence of the "dead MatMuls":

However, the current head of the branch does not produce "dead MatMuls" anymore. Does that mean that the problem has been solved along the way?

Also, the quantized models run fine in the pipeline (tested with ORT)


          [WIP] Inject core cache ops - pattern matching + base export

decab08

bfineran requested a review from dbogunowicz

May 11, 2023 22:57

bfineran self-assigned this

Member Author

bfineran commented May 11, 2023

status so far - confirmed the 'generic' pattern matching works for OPT, next step is adding the injection

bfineran added 3 commits

May 12, 2023 08:21


          complete initial implementation of CacheKeysAndValues

c2f5f72


          documentation + suggestions from Damian

ae0b966


          Cache length adjustment - ABC + OPT impl

77d7372

Member Author

bfineran commented May 12, 2023

@dbogunowicz initial implementation of KV cache concats + OPT Cache length adjustment completed

for some reason the exporter I wrote was cleared in my environment, will get that up at a later date
Sample code to try:

import onnx
from sparseml.exporters.transforms.kv_cache import *
model = onnx.load("/home/benjamin/tmp-models/small_decoder_opt.onnx", load_external_data=False)
model = CacheKeysAndValues().transform(model)
model = OPTCacheLengthAdjustment().transform(model)

Example Cached MatMul:

Example Cache Length adjustment:

bfineran and others added 3 commits

May 12, 2023 12:23


          typo

5d97759


          little cleanup, more importantly, started testing

1c251c4


          stuck on testing

830489a

bfineran commented

View reviewed changes

Member Author

bfineran left a comment

@dbogunowicz quick comments, pushing up adjustment to reshape

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py Outdated

-              # no great way to generically infer this from the graph since transposes can
-              # be used to place it on either side of the matmul
-              # hardcoding for now, will update to have a hardcoded value for each model type
-              _KEY_NODE_INPUT_IDX = 0

Member Author

bfineran May 15, 2023

why did we take this out?

Contributor

dbogunowicz May 15, 2023

for now it can be hard-coded as an argument to the function - no changes between CodeGen and OPT

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py Outdated Show resolved Hide resolved

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py Show resolved Hide resolved

bfineran and others added 12 commits

May 15, 2023 10:56


          move KV cache concat to before transpose where applicable

7a59695


          working for dynamic seq len

4f34e40


          add support for slicing cache by actual length

700e650


          add position_embeddings_adjustment

32ff898


          quantized model support

c71bae7


          Merge branch 'main' into kv-cache-injection

49fff17


          Support Q/DQ folding of Parameterized matmuls w/o bias add

892fb7a


          Merge branch 'kv-cache-injection' of github.com:neuralmagic/sparseml …

885cf20

…into kv-cache-injection


          delete Exporter for KV cache for now - since checker doesn't pass, wi…

8ee9ae1

…ll do transforms ad-hoc


          quality

e2afac9


          refactor

4b1bf11


          Merge branch 'kv-cache-injection' of https://github.com/neuralmagic/s…

7a0cf83

…parseml into kv-cache-injection

dbogunowicz changed the title ~~[WIP] 'Generic' KV cache injection support~~ 'Generic' KV cache injection support

dbogunowicz added 3 commits

May 22, 2023 15:00


          Merge branch 'main' into kv-cache-injection

23c29ec


          Merge branch 'main' into kv-cache-injection

35df24d


          Merge branch 'main' into kv-cache-injection

e1ceca1

dbogunowicz reviewed

View reviewed changes

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py Outdated Show resolved Hide resolved

dbogunowicz reviewed

View reviewed changes

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py Outdated Show resolved Hide resolved

dbogunowicz reviewed

View reviewed changes

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py Outdated Show resolved Hide resolved

dbogunowicz marked this pull request as ready for review

May 24, 2023 12:56

KSGulin previously approved these changes

View reviewed changes

Contributor

KSGulin left a comment •

edited

Loading

This looks great. Building KV cache injection in ONNX feels like the ML equivalent of Chris Sawyer creating RollerCoaster Tycoon in assembly, but this was clean and easy to follow. Left a couple non-blocking comments

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py

+                  """
+                  graph = ONNXGraph(model)
+                  if node.op_type == "QLinearMatMul" and cache_input_idx == 1:

Contributor

KSGulin May 24, 2023

Is there a reason to check if cache_input_idx == 1 here instead of setting to 3 regardless?

src/sparseml/exporters/transforms/kv_cache/cache_keys_and_values.py Show resolved Hide resolved


          fix docstrings

b72c578

dbogunowicz dismissed KSGulin’s stale review via

b72c578

May 25, 2023 11:03


          Merge branch 'main' into kv-cache-injection

b86d298

dbogunowicz changed the title ~~'Generic' KV cache injection support~~ [KV cache] Base, generic KV Cache injection support

dbogunowicz mentioned this pull request

[Feature Branch] OPT KV Cache Injection Logic #1588

Merged

3 tasks

dbogunowicz changed the base branch from main to feature/damian/fb_kv_cache

May 29, 2023 10:05

dbogunowicz added 2 commits

May 29, 2023 10:23


          hardening the validation

7224ebf


          validator not needed

8de1675

KSGulin approved these changes

View reviewed changes

bfineran mentioned this pull request

ONNXToDeepSparse matmul integer conversion patches for whipser support #1616

Merged

dbogunowicz merged commit d4bd539 into feature/damian/fb_kv_cache

dbogunowicz deleted the kv-cache-injection branch

June 12, 2023 14:54

dbogunowicz added a commit that referenced this pull request


          [Feature Branch] OPT KV Cache Injection Logic (#1588)

e2589ea

* Update __init__.py

* [KV cache] Base, generic KV Cache injection support (#1559)

* [WIP] Inject core cache ops - pattern matching + base export

* complete initial implementation of CacheKeysAndValues

* documentation + suggestions from Damian

* Cache length adjustment - ABC + OPT impl

* typo

* little cleanup, more importantly, started testing

* stuck on testing

* move KV cache concat to before transpose where applicable

* working for dynamic seq len

* add support for slicing cache by actual length

* add position_embeddings_adjustment

* quantized model support

* Support Q/DQ folding of Parameterized matmuls w/o bias add

* delete Exporter for KV cache for now - since checker doesn't pass, will do transforms ad-hoc

* quality

* refactor

* fix docstrings

* hardening the validation

* validator not needed

---------

Co-authored-by: Damian <damian@neuralmagic.com>
Co-authored-by: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com>

* [KV cache] Properly set the static dimensions of the kv cache inputs/outputs (#1573)

* [WIP] Inject core cache ops - pattern matching + base export

* complete initial implementation of CacheKeysAndValues

* documentation + suggestions from Damian

* Cache length adjustment - ABC + OPT impl

* typo

* little cleanup, more importantly, started testing

* stuck on testing

* move KV cache concat to before transpose where applicable

* working for dynamic seq len

* add support for slicing cache by actual length

* add position_embeddings_adjustment

* quantized model support

* Support Q/DQ folding of Parameterized matmuls w/o bias add

* delete Exporter for KV cache for now - since checker doesn't pass, will do transforms ad-hoc

* quality

* refactor

* initial commit

* fix docstrings

* hardening the validation

* validator not needed

* adressing PR comments

---------

Co-authored-by: Benjamin <ben@neuralmagic.com>
Co-authored-by: bogunowicz@arrival.com <bogunowicz@arrival.com>

* [KV cache] Input/output KV cache to include `batch` dimension. (#1589)

* [WIP] Inject core cache ops - pattern matching + base export

* complete initial implementation of CacheKeysAndValues

* documentation + suggestions from Damian

* Cache length adjustment - ABC + OPT impl

* typo

* little cleanup, more importantly, started testing

* stuck on testing

* move KV cache concat to before transpose where applicable

* working for dynamic seq len

* add support for slicing cache by actual length

* add position_embeddings_adjustment

* quantized model support

* Support Q/DQ folding of Parameterized matmuls w/o bias add

* delete Exporter for KV cache for now - since checker doesn't pass, will do transforms ad-hoc

* quality

* refactor

* initial commit

* fix docstrings

* initial commit

* hardening the validation

* validator not needed

* tested with deepsparse

* ready for reviews

* adressing PR comments

* addressing PR comments

* ready to land

---------

Co-authored-by: Benjamin <ben@neuralmagic.com>
Co-authored-by: bogunowicz@arrival.com <bogunowicz@arrival.com>

* remove  changes

---------

Co-authored-by: Konstantin Gulin <66528950+KSGulin@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin <ben@neuralmagic.com>
Co-authored-by: bogunowicz@arrival.com <bogunowicz@arrival.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment