Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Text Generation] Terminate the inference when kv cache is full #1446

Merged

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Dec 1, 2023

Feature Description

Once the KV cache is full, instead of continuing the inference by removing the old cache entries to make place for the new ones, we now terminate the inference with the finish reason "capacity"

Manual Testing

from deepsparse import Pipeline
prompt = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned80_quantized"
pipeline = Pipeline.create(task="text-generation", model_path=model_path, sequence_length=64)
out = pipeline(prompt=prompt)

Before:

displaying out.generations[0].text and out.generations[0].finished_reason:

text='He runs 60*3=<<60*3=180>>180 meters in total per sprint,  Comays 5  \nen3 was 2= a en6ound'
finished=True, finished_reason='max_new_tokens'

Now:

ext='He runs 60*3=<<60*3=180>>180 meters in total per s'
finished=True, finished_reason='capacity'

Satrat
Satrat previously approved these changes Dec 1, 2023
Satrat
Satrat previously approved these changes Dec 1, 2023
@dbogunowicz dbogunowicz merged commit 29e1356 into main Dec 6, 2023
1 of 13 checks passed
@dbogunowicz dbogunowicz deleted the feature/damian/terminate_inference_when_kv_cache_full branch December 6, 2023 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants