[KV Cache Injection] Causal Mask for OPT #1688

dbogunowicz · 2023-07-25T06:23:33Z

Add causal mask support for OPT models to enable multitoken prefill in the Deepsparse pipeline.

Manual Testing

Export the OPT model

python kv_cache_injector.py --input-file deployment/model.onnx --output-file deployment/model_kvcache.onnx

Inject the KV Cache

python kv_cache_injector.py --input-file deployment/model.onnx --output-file deployment/model_kvcache.onnx
2023-07-25 13:18:44 sparseml.exporters.transforms.kv_cache.configs INFO     Loaded config file deployment/config.json for model: opt
2023-07-25 13:18:44 sparseml.exporters.transforms.kv_cache.configs INFO     Properly configured arguments for KV Cache Transformation
2023-07-25 13:18:46 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 48 matches
2023-07-25 13:18:49 sparseml.exporters.transforms.onnx_transform INFO     [AdditionalTransformsOPT] Transformed 2 matches

Run inference (using this branch: [Text Generation] Causal Mask Feature Branch deepsparse#1126)

from deepsparse import Pipeline

def _test_pipeline(engine_type):
    opt = Pipeline.create(task="opt",
                          model_path="/home/ubuntu/damian/sparseml/deployment",
                          engine_type=engine_type,
                          prompt_processing_sequence_length=64,
                          use_deepsparse_cache = False,
                          max_generated_tokens=32)
    print('----------')
    prompt = "Who is the president of the United States?" # the prompt is short, will not be processed by self.multitoken_engine
    out = opt(sequences=prompt, return_logits = True)
    print(out.sequences[0])
    print('---------')
    prompt = "Who is the president of the United States?" * 20 # the prompt is long, will be processed by self.multitoken_engine
    out = opt(sequences=prompt, return_logits=True)
    print(out.sequences[0])

_test_pipeline(engine_type ="onnxruntime")
_test_pipeline(engine_type ="deepsparse")

2023-07-25 13:23:35 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-25 13:23:41 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-25 13:23:51 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-25 13:23:57 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
----------


The president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the
---------
Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of
/home/ubuntu/damian/deepsparse/src/deepsparse/transformers/pipelines/text_generation.py:137: UserWarning: AVX512 support not detected, disabling internal management of KV cache which may affect performance. To enable full performance, deploy on an AVX512-compatible system.
  warnings.warn(
2023-07-25 13:24:16 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20230725 COMMUNITY | (f26e1c2e) (release) (optimized) (system=avx2, binary=avx2)
2023-07-25 13:24:31 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
----------


The president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the
---------
Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of

…causal_mask_codegen

bfineran

LGTM pending rebase and comment regarding strong preference on using cast over where

bfineran · 2023-07-25T19:45:48Z

src/sparseml/exporters/transforms/kv_cache/transforms_base.py


-    @classmethod
-    def add_positions_input(cls, model: ModelProto) -> ModelProto:
+    def add_causal_mask_input(self, model: ModelProto) -> ModelProto:


looks like this needs rebase?

bfineran · 2023-07-25T19:46:04Z

src/sparseml/exporters/transforms/kv_cache/transforms_codegen.py

@@ -12,78 +12,55 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from onnx import ModelProto, NodeProto


bfineran · 2023-07-25T19:47:39Z

src/sparseml/exporters/transforms/kv_cache/transforms_opt.py

+        ```
+        |       causal_mask
+        |            |
+        |          Where


why are we using a where instead of a cast to bool? can you check with runtime for what would work better if either are fine? additionally, seems like a cast would read better in onnx vs where which involves a condition...

…1677) * initial commit * [KV Cache Injection] Causal Mask for CodeGen (#1676) * initial implementation; testing now * fix a small blunder * cleanup --------- Co-authored-by: bogunowicz@arrival.com <bogunowicz@arrival.com> * [KV Cache Injection] Causal Mask for OPT (#1688) * initial implementation; testing now * fix a small blunder * cleanup * initial implementation * on to testing with deepsparse --------- Co-authored-by: bogunowicz@arrival.com <bogunowicz@arrival.com> * replace boolean causal mask for int64 causal mask * better logging info * allow transformations to be also a list --------- Co-authored-by: bogunowicz@arrival.com <bogunowicz@arrival.com>

bogunowicz@arrival.com and others added 6 commits July 18, 2023 17:32

initial implementation; testing now

c4ae0e7

fix a small blunder

54bffaa

Merge branch 'main' into feature/damian/causal_mask_codegen

582f8c0

Merge branch 'feature/damian/refactor_injection' into feature/damian/…

346be58

…causal_mask_codegen

cleanup

fa10290

initial implementation

b9270a0

dbogunowicz changed the base branch from main to feature/damian/causal_mask_codegen July 25, 2023 08:42

on to testing with deepsparse

4f99b62

dbogunowicz marked this pull request as ready for review July 25, 2023 13:17

Base automatically changed from feature/damian/causal_mask_codegen to feature/damian/refactor_injection July 25, 2023 14:35

dbogunowicz requested a review from bfineran July 25, 2023 14:45

bfineran reviewed Jul 25, 2023

View reviewed changes

dbogunowicz merged commit db62ca0 into feature/damian/refactor_injection Jul 26, 2023

dbogunowicz deleted the feature/damian/causal_mask_opt branch July 26, 2023 06:31

dbogunowicz mentioned this pull request Jul 26, 2023

[KV Cache Injection] Causal Mask implementation for OPT and CodeGen #1677

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KV Cache Injection] Causal Mask for OPT #1688

[KV Cache Injection] Causal Mask for OPT #1688

dbogunowicz commented Jul 25, 2023 •

edited

Loading

bfineran left a comment

bfineran Jul 25, 2023

bfineran Jul 25, 2023

bfineran Jul 25, 2023

[KV Cache Injection] Causal Mask for OPT #1688

[KV Cache Injection] Causal Mask for OPT #1688

Conversation

dbogunowicz commented Jul 25, 2023 • edited Loading

Manual Testing

bfineran left a comment

Choose a reason for hiding this comment

bfineran Jul 25, 2023

Choose a reason for hiding this comment

bfineran Jul 25, 2023

Choose a reason for hiding this comment

bfineran Jul 25, 2023

Choose a reason for hiding this comment

dbogunowicz commented Jul 25, 2023 •

edited

Loading