[Text Generation] Terminate the inference when kv cache is full #1446

dbogunowicz · 2023-12-01T15:03:57Z

Feature Description

Once the KV cache is full, instead of continuing the inference by removing the old cache entries to make place for the new ones, we now terminate the inference with the finish reason "capacity"

Manual Testing

from deepsparse import Pipeline
prompt = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned80_quantized"
pipeline = Pipeline.create(task="text-generation", model_path=model_path, sequence_length=64)
out = pipeline(prompt=prompt)

Before:

displaying out.generations[0].text and out.generations[0].finished_reason:

text='He runs 60*3=<<60*3=180>>180 meters in total per sprint,  Comays 5  \nen3 was 2= a en6ound'
finished=True, finished_reason='max_new_tokens'

Now:

ext='He runs 60*3=<<60*3=180>>180 meters in total per s'
finished=True, finished_reason='capacity'

src/deepsparse/transformers/pipelines/text_generation.py

tests/deepsparse/transformers/pipelines/test_text_generation.py

…ache_full

initial commit

c247618

dbogunowicz force-pushed the feature/damian/terminate_inference_when_kv_cache_full branch from 4be8a07 to c247618 Compare December 1, 2023 15:09

dbogunowicz mentioned this pull request Dec 1, 2023

[Cherry-Pick][Text Generation] Terminate the inference when kv cache is full #1447

Merged

dbogunowicz requested review from dsikka, rahul-tuli, Satrat and mgoin December 1, 2023 15:14

Satrat previously approved these changes Dec 1, 2023

View reviewed changes

dbogunowicz commented Dec 1, 2023

View reviewed changes

src/deepsparse/transformers/pipelines/text_generation.py Outdated Show resolved Hide resolved

dbogunowicz dismissed Satrat’s stale review via 8d0768d December 1, 2023 15:56

Update src/deepsparse/transformers/pipelines/text_generation.py

8d0768d

dbogunowicz requested a review from Satrat December 1, 2023 15:59

Satrat previously approved these changes Dec 1, 2023

View reviewed changes

added tests

0907842

dbogunowicz dismissed Satrat’s stale review via 0907842 December 4, 2023 13:06

bfineran previously approved these changes Dec 4, 2023

View reviewed changes

tests/deepsparse/transformers/pipelines/test_text_generation.py Show resolved Hide resolved

dbogunowicz requested a review from Satrat December 5, 2023 10:46

replace numpy isclose with explicit equal

32c59b8

dbogunowicz dismissed bfineran’s stale review via 32c59b8 December 5, 2023 10:48

dbogunowicz requested a review from bfineran December 5, 2023 10:48

Merge branch 'main' into feature/damian/terminate_inference_when_kv_c…

eb29b87

…ache_full

bfineran approved these changes Dec 6, 2023

View reviewed changes

Merge branch 'main' into feature/damian/terminate_inference_when_kv_c…

e7f1a29

…ache_full

Satrat approved these changes Dec 6, 2023

View reviewed changes

dbogunowicz merged commit 29e1356 into main Dec 6, 2023
1 of 13 checks passed

dbogunowicz deleted the feature/damian/terminate_inference_when_kv_cache_full branch December 6, 2023 15:46

dbogunowicz mentioned this pull request Dec 7, 2023

[Pipeline Refactor] Migration #1460

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Text Generation] Terminate the inference when kv cache is full #1446

[Text Generation] Terminate the inference when kv cache is full #1446

dbogunowicz commented Dec 1, 2023 •

edited

Loading

[Text Generation] Terminate the inference when kv cache is full #1446

[Text Generation] Terminate the inference when kv cache is full #1446

Conversation

dbogunowicz commented Dec 1, 2023 • edited Loading

Feature Description

Manual Testing

Before:

Now:

dbogunowicz commented Dec 1, 2023 •

edited

Loading