[Text Generation] Causal Mask Support #1127

dbogunowicz · 2023-07-19T11:56:44Z

Allows the user to set the argument prompt_processing_sequence_length of the TextGeneration pipeline to a value different from sequence_length, effectively enabling running a multitoken_engine in a scenario, when we feed it input_ids of any length, and robustly provide kv cache support. In other words, it enables prefilling the cache using subsequences of different lengths.

Manual Testing

Complementary feature (and PR) from Sparseml: neuralmagic/sparseml#1676

from deepsparse import Pipeline

def _test_pipeline(engine_type):
    opt = Pipeline.create(task="opt",
                          model_path="/home/ubuntu/damian/sparseml/deployment",
                          engine_type=engine_type,
                          prompt_processing_sequence_length=64,
                          use_deepsparse_cache = False,
                          max_generated_tokens=32)
    print('----------')
    prompt = "def hello_world():" # the prompt is short, will not be processed by self.multitoken_engine
    out = opt(sequences=prompt, return_logits = True)
    print(out.sequences[0])
    print('---------')
    prompt = "def hello_world():" * 20 # the prompt is long, will be processed by self.multitoken_engine
    out = opt(sequences=prompt, return_logits=True)
    print(out.sequences[0])

_test_pipeline(engine_type ="onnxruntime")
_test_pipeline(engine_type ="deepsparse")

#### INFERENCE WITH ONNXRUNTIME ####
2023-07-20 05:56:04 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 05:56:10 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 05:56:18 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 05:56:24 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
----------

    return 'Hello World!'

def hello_world_2():
    return 'Hello World!'

def hello_world_3(): # output is identical to the one from the torch model (baseline)
---------
def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello # output is identical to the one from the torch model (baseline)

#### INFERENCE WITH ONNXRUNTIME ####
# Note that we do not see a warning stating that multitoken engine must default to onnxruntime, because the deepsparse runtime cannot support it! Now both single token and multi token engines are running using deepsparse!

/home/ubuntu/damian/deepsparse/src/deepsparse/transformers/pipelines/text_generation.py:126: UserWarning: AVX512 support not detected, disabling internal management of KV cache which may affect performance. To enable full performance, deploy on an AVX512-compatible system.
  warnings.warn(
Using pad_token, but it is not set yet.
2023-07-20 05:56:37 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20230711 COMMUNITY | (34a5203e) (release) (optimized) (system=avx2, binary=avx2)
2023-07-20 05:58:58 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
----------

    return 'Hello World!'

def hello_world_2():
    return 'Hello World!'

def hello_world_3():
---------
def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello

(in this scenario we run with the default prompt_processing_sequence_length=64, but setting it to 32 gives the same result naturally)

…al_mask_support

…om/neuralmagic/deepsparse into feature/damian/causal_mask_support

…al_mask_support

src/deepsparse/transformers/pipelines/text_generation.py

* Update helpers.py * correct implementation of the mapping from inputs to causal mask * [Text Generation] Causal Mask Support (#1127) * initial commit * clean up the PR * working implementation * Ben's review comments * [Text Generation] Multitoken prefill enablement (#1130) * initial commit * clean up the PR * working implementation * initial implementation, hacky lets clean it up * ready for review * few tiny quality improvements * simplify the logic for computing num of unmasked bits for creating attention_mask for the multitoken prefill * replace boolean causal mask for int64 causal mask * fix breaking tests

initial commit

7e0724a

dbogunowicz changed the base branch from main to feature/damian/causal_mask_fb July 19, 2023 11:56

Merge branch 'feature/damian/causal_mask_fb' into feature/damian/caus…

3ec669a

…al_mask_support

dbogunowicz requested review from bfineran and markurtz July 19, 2023 12:01

dbogunowicz added 2 commits July 19, 2023 12:02

clean up the PR

f7d55be

Merge branch 'feature/damian/causal_mask_support' of https://github.c…

5b5a500

…om/neuralmagic/deepsparse into feature/damian/causal_mask_support

dbogunowicz mentioned this pull request Jul 19, 2023

[Text Generation] Causal Mask Feature Branch #1126

Merged

2 tasks

dbogunowicz marked this pull request as draft July 19, 2023 12:32

dbogunowicz removed request for bfineran and markurtz July 19, 2023 12:32

working implementation

8d5dcad

dbogunowicz marked this pull request as ready for review July 20, 2023 06:08

dbogunowicz requested review from bfineran and markurtz July 20, 2023 06:11

Merge branch 'feature/damian/causal_mask_fb' into feature/damian/caus…

e2f474e

…al_mask_support

dbogunowicz mentioned this pull request Jul 20, 2023

[KV Cache Injection] Causal Mask for CodeGen neuralmagic/sparseml#1676

Merged

bfineran requested changes Jul 24, 2023

View reviewed changes

src/deepsparse/transformers/pipelines/text_generation.py Outdated Show resolved Hide resolved

src/deepsparse/transformers/pipelines/text_generation.py Outdated Show resolved Hide resolved

src/deepsparse/transformers/pipelines/text_generation.py Show resolved Hide resolved

Ben's review comments

eeeddfa

dbogunowicz merged commit cbab152 into feature/damian/causal_mask_fb Jul 25, 2023

dbogunowicz deleted the feature/damian/causal_mask_support branch July 25, 2023 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Text Generation] Causal Mask Support #1127

[Text Generation] Causal Mask Support #1127

dbogunowicz commented Jul 19, 2023 •

edited

Loading

[Text Generation] Causal Mask Support #1127

[Text Generation] Causal Mask Support #1127

Conversation

dbogunowicz commented Jul 19, 2023 • edited Loading

Manual Testing

dbogunowicz commented Jul 19, 2023 •

edited

Loading