Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Branch] KV Cache Interface #1083

Merged
merged 109 commits into from
Jul 12, 2023
Merged

[Feature Branch] KV Cache Interface #1083

merged 109 commits into from
Jul 12, 2023

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Jun 21, 2023

Feature Preview

Feature branch that aggregates all the features constituting the KV Cache Interface implementation. This includes:

No-cache inference

from deepsparse import Pipeline
import time
start = time.time()
opt = Pipeline.create(task="opt",
                      model_path="/home/ubuntu/damian/sparseml/deployment",
                      engine_type = "onnxruntime",
                      max_generated_tokens=1)
prompt = "Who is the president of the United States?"
output = opt(sequences=prompt, return_logits=True)
sequences=['\n'] logits=array([[[-12.644863 , -12.9746065,   2.577626 , ..., -13.5366125,
         -13.376596 , -14.587112 ]]], dtype=float32) session_id=None # same as in pytorch inference
Ground truth: [-12.6449, -12.9746,   2.5776,  ..., -13.5366, -13.3766, -14.5871]

Single-token engine decoding only:

from deepsparse import Pipeline
opt = Pipeline.create(task="opt",
                      model_path="/home/ubuntu/damian/sparseml/deployment",
                      engine_type = "onnxruntime",
                      max_generated_tokens=128)
prompt = "Who is the president of the United States?"
output = opt(sequences=prompt)
print(output.sequences)
2023-06-27 07:55:20 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:55:24 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:56:37 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:56:40 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
['\n\nThe president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government.\n\nThe president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive']
Ground truth: The president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government.

The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government.

Single-token engine and multi-token engine decoding:

from deepsparse import Pipeline
opt = Pipeline.create(task="opt",
                      model_path="/home/ubuntu/damian/sparseml/deployment",
                      engine_type = "onnxruntime",
                      max_generated_tokens=128)
prompt = "Who is the president of the United States?" * 20
output = opt(sequences=prompt)
print(output.sequences)
2023-06-27 07:57:53 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:58:47 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:58:52 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-06-27 07:58:58 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
['Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is']
Ground truth: Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?

Testing Scope

Manual Tests

The script below terminates without raising an error

from deepsparse import Pipeline



def test_pipeline(engine_type):
    opt = Pipeline.create(task="opt",
                          model_path="/home/ubuntu/damian/sparseml/deployment",
                          engine_type=engine_type,
                          max_generated_tokens=32)

    prompt1 = "Who is the president of the United States?"
    prompt2 = "Who is the president of the United States?" * 20

    # test correctness unbatched input for a single-token engine
    out = opt(sequences=prompt1)
    out_ = opt(sequences=[prompt1])
    for x in [out, out_]:
        assert x.sequences[0] == '\n\nThe president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the'

    # test correctness unbatched input for a multi-token engine (very long input)
    out = opt(sequences=prompt2)
    out_ = opt(sequences=[prompt2])
    for x in [out, out_]:
        assert x.sequences[0] == 'Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of'

    # test correctness batched input same input lengths
    out = opt(sequences=[prompt1, prompt1])
    for x in range(2):
        assert out.sequences[x] == '\n\nThe president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the'

    # test correctness batched input different input lengths
    out = opt(sequences=[prompt1, prompt2])
    assert out.sequences[0] == '\n\nThe president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the'
    assert out.sequences[1] == 'Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of'


test_pipeline(engine_type = "onnxruntime")
test_pipeline(engine_type = "deepsparse")

Testing with eval downstream

python src/deepsparse/transformers/eval_downstream.py /home/ubuntu/damian/sparseml/deployment --dataset openai_humaneval --engine onnxruntime --max-samples 4

HF baseline:

{'perplexities': [6.606388092041016, 9.294904708862305, 17.560449600219727, 13.867135047912598], 'mean_perplexity': 11.832219362258911}

Result with kv cache model

openai_humaneval eval results: {'mean_perplexity': 11.834174752235413, 'perplexities': [6.607589244842529, 9.296476364135742, 17.56318473815918, 13.8694486618042]}
# results match the baseline

Result with non-kv cache model

openai_humaneval eval results: {'mean_perplexity': 11.834173798561096, 'perplexities': [6.607589244842529, 9.296476364135742, 17.563180923461914, 13.8694486618042]}
# results match the baseline

Current Limitations

  • we are unable to run the "autoregressive multi-token" inference scenario. To be added in the future
  • due to the limited deepsparse engine for the kv cache, we are not using LIB.kv_cache object for cache manipulation. We are also unable to run multi-token inference in the engine due to the issue with "zero-length" cache ingestion. In the deepsparse engine inference, whenever a sequence would normally be processed by the multi-token engine, the single-token engine will take over instead
  • the initialization time for the deepsparse engine is long (a few minutes).

dbogunowicz and others added 30 commits June 5, 2023 15:55
* initial commit

* coreys simplifications

* finishing the second model static

* ready, time for beautification

* ready for review

* moved the code to examples

* fix eos logic

* add argument num_tokens_to_generate
* initial commit

* coreys simplifications

* finishing the second model static

* ready, time for beautification

* ready for review

* moved the code to examples

* fix eos logic

* add argument num_tokens_to_generate

* initial commit

* change order

* Update examples/codegen/README.md

Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com>

---------

Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com>
bfineran
bfineran previously approved these changes Jul 10, 2023
Copy link
Member

@bfineran bfineran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - we 100% need a bit more testing, let's make a plan for that. Let's also include the deepsparse vs ort perplexities in the description

rahul-tuli
rahul-tuli previously approved these changes Jul 11, 2023
src/deepsparse/pipeline.py Show resolved Hide resolved
src/deepsparse/tasks.py Show resolved Hide resolved
src/deepsparse/tasks.py Outdated Show resolved Hide resolved
src/deepsparse/transformers/README.md Show resolved Hide resolved
src/deepsparse/transformers/engines/nl_decoder_engine.py Outdated Show resolved Hide resolved
@dbogunowicz dbogunowicz dismissed stale reviews from rahul-tuli and bfineran via 37e8a02 July 11, 2023 14:57
bfineran
bfineran previously approved these changes Jul 11, 2023
rahul-tuli
rahul-tuli previously approved these changes Jul 12, 2023
@dbogunowicz dbogunowicz dismissed stale reviews from rahul-tuli and bfineran via 41e9306 July 12, 2023 14:46
@bfineran bfineran merged commit c6aa08f into main Jul 12, 2023
7 checks passed
@bfineran bfineran deleted the feature/damian/fb_kv_cache branch July 12, 2023 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants