[Text Generation] Multitoken prefill enablement #1130

dbogunowicz · 2023-07-20T14:41:06Z

Enable running the pipeline in the mode, where the prompt is processed (prefill scenario) through multiple, consecutive passes through the multitoken engine. The goal is to achieve optimal inference speed with the deepsparse engine.

Manual Testing

from deepsparse import Pipeline



def _test_pipeline(engine_type, prompt_processing_sequence_length):
    opt = Pipeline.create(task="opt",
                          model_path="/home/ubuntu/damian/sparseml/deployment",
                          engine_type=engine_type,
                          use_deepsparse_cache = False,
                          prompt_processing_sequence_length = prompt_processing_sequence_length,
                          max_generated_tokens=32)
    prompt = "def hello_world():" * 20 # long prompt, so it gets processed by multitoken engine
    out = opt(sequences=prompt, return_logits=True)
    print(out.sequences[0])

for prompt_processing_sequence_length in (8, 16, 55, 128):
    _test_pipeline(engine_type ="onnxruntime", prompt_processing_sequence_length=prompt_processing_sequence_length)

Results:

# Results are identical to the pytorch baseline 

2023-07-20 15:58:14 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 15:58:24 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 15:58:32 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 15:58:39 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello

2023-07-20 15:58:53 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 15:58:59 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 15:59:12 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 15:59:20 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello

2023-07-20 15:59:33 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 15:59:40 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 16:01:12 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 16:01:18 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello

2023-07-20 16:02:38 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 16:02:44 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 16:02:54 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-20 16:03:00 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello_world():def hello

Process finished with exit code 0

…al_mask_support

…om/neuralmagic/deepsparse into feature/damian/causal_mask_support

…al_mask_support

bfineran

Took a deeper look following our offline conversation and understand why you had to go this route - LGTM, but let's update an existing diagram or add a new one to explain the relationship between decoder engine, cache, state, state transfer, and capacity

bfineran · 2023-07-24T18:08:08Z

src/deepsparse/transformers/pipelines/text_generation.py

+            # self.prompt_processing_sequence_length)
+            num_non_blank_cache_entries = min(
+                num_non_blank_cache_entries,
+                self.sequence_length - self.prompt_processing_sequence_length,


shouldn't this be the total remaining tokens ie something like self.sequence_length - idx * self.prompt_processing_sequence_length or am I missing something?

We are essentially talking about the same thing, but my logic was way too overcomplicated. Refactored the function, so now hopefully anyone reading should grasp what's going on.

…nto feature/damian/multitoken_prefill

…tention_mask for the multitoken prefill

* Update helpers.py * correct implementation of the mapping from inputs to causal mask * [Text Generation] Causal Mask Support (#1127) * initial commit * clean up the PR * working implementation * Ben's review comments * [Text Generation] Multitoken prefill enablement (#1130) * initial commit * clean up the PR * working implementation * initial implementation, hacky lets clean it up * ready for review * few tiny quality improvements * simplify the logic for computing num of unmasked bits for creating attention_mask for the multitoken prefill * replace boolean causal mask for int64 causal mask * fix breaking tests

dbogunowicz and others added 7 commits July 19, 2023 11:48

initial commit

7e0724a

Merge branch 'feature/damian/causal_mask_fb' into feature/damian/caus…

3ec669a

…al_mask_support

clean up the PR

f7d55be

Merge branch 'feature/damian/causal_mask_support' of https://github.c…

5b5a500

…om/neuralmagic/deepsparse into feature/damian/causal_mask_support

working implementation

8d5dcad

Merge branch 'feature/damian/causal_mask_fb' into feature/damian/caus…

e2f474e

…al_mask_support

initial implementation, hacky lets clean it up

189850e

dbogunowicz changed the base branch from main to feature/damian/causal_mask_support July 20, 2023 14:41

ready for review

eaa8188

dbogunowicz marked this pull request as ready for review July 20, 2023 15:46

dbogunowicz requested a review from bfineran July 20, 2023 15:47

few tiny quality improvements

a2078bc

dbogunowicz mentioned this pull request Jul 20, 2023

[Text Generation] Causal Mask Feature Branch #1126

Merged

2 tasks

bfineran approved these changes Jul 24, 2023

View reviewed changes

This was referenced Jul 25, 2023

[Text Generation] KV Cache internal Deepsparse support #1135

Merged

[Text Generation] Causal Mask Support #1127

Merged

Base automatically changed from feature/damian/causal_mask_support to feature/damian/causal_mask_fb July 25, 2023 08:22

dbogunowicz added 2 commits July 25, 2023 08:38

Merge remote-tracking branch 'origin/feature/damian/causal_mask_fb' i…

4b2c196

…nto feature/damian/multitoken_prefill

simplify the logic for computing num of unmasked bits for creating at…

945c4af

…tention_mask for the multitoken prefill

dbogunowicz merged commit e324cdc into feature/damian/causal_mask_fb Jul 25, 2023

dbogunowicz deleted the feature/damian/multitoken_prefill branch July 25, 2023 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Text Generation] Multitoken prefill enablement #1130

[Text Generation] Multitoken prefill enablement #1130

dbogunowicz commented Jul 20, 2023 •

edited

Loading

bfineran left a comment

bfineran Jul 24, 2023

dbogunowicz Jul 25, 2023

[Text Generation] Multitoken prefill enablement #1130

[Text Generation] Multitoken prefill enablement #1130

Conversation

dbogunowicz commented Jul 20, 2023 • edited Loading

Manual Testing

bfineran left a comment

Choose a reason for hiding this comment

bfineran Jul 24, 2023

Choose a reason for hiding this comment

dbogunowicz Jul 25, 2023

Choose a reason for hiding this comment

dbogunowicz commented Jul 20, 2023 •

edited

Loading