Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Text Generation] Causal Mask Support #1127

Merged

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Jul 19, 2023

Allows the user to set the argument prompt_processing_sequence_length of the TextGeneration pipeline to a value different from sequence_length, effectively enabling running a multitoken_engine in a scenario, when we feed it input_ids of any length, and robustly provide kv cache support. In other words, it enables prefilling the cache using subsequences of different lengths.

Manual Testing

Complementary feature (and PR) from Sparseml: neuralmagic/sparseml#1676

from deepsparse import Pipeline

def _test_pipeline(engine_type):
    opt = Pipeline.create(task="opt",
                          model_path="/home/ubuntu/damian/sparseml/deployment",
                          engine_type=engine_type,
                          prompt_processing_sequence_length=64,
                          use_deepsparse_cache = False,
                          max_generated_tokens=32)
    print('----------')
    prompt = "def hello_world():" # the prompt is short, will not be processed by self.multitoken_engine
    out = opt(sequences=prompt, return_logits = True)
    print(out.sequences[0])
    print('---------')
    prompt = "def hello_world():" * 20 # the prompt is long, will be processed by self.multitoken_engine
    out = opt(sequences=prompt, return_logits=True)
    print(out.sequences[0])

_test_pipeline(engine_type ="onnxruntime")
_test_pipeline(engine_type ="deepsparse")
#### INFERENCE WITH ONNXRUNTIME ####
2023-07-20 05:56:04 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 05:56:10 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 05:56:18 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 05:56:24 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
----------

    return 'Hello World!'

def hello_world_2():
    return 'Hello World!'

def hello_world_3(): # output is identical to the one from the torch model (baseline)
---------
def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello # output is identical to the one from the torch model (baseline)

#### INFERENCE WITH ONNXRUNTIME ####
# Note that we do not see a warning stating that multitoken engine must default to onnxruntime, because the deepsparse runtime cannot support it! Now both single token and multi token engines are running using deepsparse!

/home/ubuntu/damian/deepsparse/src/deepsparse/transformers/pipelines/text_generation.py:126: UserWarning: AVX512 support not detected, disabling internal management of KV cache which may affect performance. To enable full performance, deploy on an AVX512-compatible system.
  warnings.warn(
Using pad_token, but it is not set yet.
2023-07-20 05:56:37 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20230711 COMMUNITY | (34a5203e) (release) (optimized) (system=avx2, binary=avx2)
2023-07-20 05:58:58 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
----------

    return 'Hello World!'

def hello_world_2():
    return 'Hello World!'

def hello_world_3():
---------
def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello

(in this scenario we run with the default prompt_processing_sequence_length=64, but setting it to 32 gives the same result naturally)

@dbogunowicz dbogunowicz changed the base branch from main to feature/damian/causal_mask_fb July 19, 2023 11:56
@dbogunowicz dbogunowicz marked this pull request as ready for review July 20, 2023 06:08
@dbogunowicz dbogunowicz merged commit cbab152 into feature/damian/causal_mask_fb Jul 25, 2023
@dbogunowicz dbogunowicz deleted the feature/damian/causal_mask_support branch July 25, 2023 08:22
bfineran pushed a commit that referenced this pull request Jul 27, 2023
* Update helpers.py

* correct implementation of the mapping from inputs to causal mask

* [Text Generation] Causal Mask Support (#1127)

* initial commit

* clean up the PR

* working implementation

* Ben's review comments

* [Text Generation] Multitoken prefill enablement (#1130)

* initial commit

* clean up the PR

* working implementation

* initial implementation, hacky lets clean it up

* ready for review

* few tiny quality improvements

* simplify the logic for computing num of unmasked bits for creating attention_mask for the multitoken prefill

* replace boolean causal mask for int64 causal mask

* fix breaking tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants