[KV Cache] BLOOM support #1664

dbogunowicz · 2023-07-11T16:41:07Z

KV Cache injection support for a BLOOM model

Usage

A sample script to inject KV Cache (create model_kvcache.onnx from model.onnx (model exported from sparseml.transformers.export)

import click
import os
import onnx
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
@click.command()
@click.option('--input-file', help='Path to the input ONNX model file')
@click.option('--output-file', help='Output path for the modified model')
def modify_model(input_file, output_file):
    model = onnx.load(input_file, load_external_data=False)
    model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
    onnx.save(model, output_file)
    print(f"Modified model saved to: {output_file}")
if __name__ == '__main__':
    modify_model()

python kv_cache_injector.py --input-file deployment/model.onnx --output-file deployment/model_kvcache.onnx

2023-07-12 09:49:25 sparseml.exporters.transforms.kv_cache.configs INFO     Loaded config file deployment/config.json for model: bloom
2023-07-12 09:49:25 sparseml.exporters.transforms.kv_cache.configs INFO     Properly configured arguments for KV Cache Transformation
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
2023-07-12 09:49:26 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 48 matches
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
Modified model saved to: deployment/model_kvcache.onnx

Feature Preview

Using the model_kvcache.onnx we can run the inference using deepsparse pipeline. The manual tests run as expected:
(using the deepsparse branch neuralmagic/deepsparse#1083 for testing)

from deepsparse import Pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")

opt = Pipeline.create(task="bloom",
                      model_path="/home/ubuntu/damian/sparseml/deployment",
                      engine_type="onnxruntime",
                      max_generated_tokens=128)

def test_prompt(prompt, pipeline, pipeline_gt, tokenizer):
    out = pipeline(sequences=prompt, return_logits=True)
    predicted_str = prompt + out.sequences[0]
    out_gt = tokenizer.batch_decode(pipeline_gt.generate(**tokenizer(prompt, return_tensors="pt"), max_length=100))
    ground_truth_str = out_gt[0]
    print(predicted_str)
    print('-------------------')
    assert predicted_str.startswith(ground_truth_str)

test_prompt("Who is the president of the United States?", opt, model, tokenizer)
test_prompt("Who is the president of the United States?" * 20, opt, model, tokenizer)

2023-07-12 09:29:46 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-12 09:29:46 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-12 09:29:47 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx
2023-07-12 09:29:47 deepsparse.utils.onnx INFO     Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx
Who is the president of the United States?”
“Mr. President, I am the president of the United States.”
“Mr. President, I am the president of the United States.”
“Mr. President, I am the president of the United States.”
“Mr. President, I am the president of the United States.”
“Mr. President, I am the president of the United States.”
“Mr. President, I am the president of the United States.”
“Mr. President, I am the president of the United States.”
“Mr. President, I am the president of the United States.
-------------------
Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is
-------------------

Testing with perplexity values

Perplexity values do not match ground truth, this has been true for both CodeGen and BLOOM models (perhaps some problems with the lack of BOS token? OPT perplexity works fine). Looking into it now.

Ground truth (perplexity ignoring BOS token)

{'mean_perplexity': 8.713947713375092, 'perplexities': [4.864218235015869, 7.714944362640381, 14.06081485748291, 12.43343448638916, 8.480101585388184, 6.525933265686035, 9.455163955688477, 6.176970958709717]}

Perplexity (KV Cache model)

openai_humaneval eval results: {'mean_perplexity': 8.33638221025467, 'perplexities': [4.608011722564697, 7.714968204498291, 13.768594741821289, 11.635414123535156, 8.102428436279297, 6.0731635093688965, 9.088720321655273, 5.699756622314453]}

Perplexity (Non KV Cache model)

openai_humaneval eval results: {'mean_perplexity': 8.336387276649475, 'perplexities': [4.608002185821533, 7.714962482452393, 13.768630981445312, 11.635422706604004, 8.102447509765625, 6.073184967041016, 9.088698387145996, 5.699748992919922]}

dbogunowicz changed the title ~~[KV Cache] BLOOM~~ [KV Cache] BLOOM support Jul 12, 2023

dbogunowicz force-pushed the feature/damian/kv_cache_bloom branch from 9c5e27d to 20d1944 Compare July 12, 2023 09:55

fix erronous rebase

8b1169e

dbogunowicz force-pushed the feature/damian/kv_cache_bloom branch from 20d1944 to 8b1169e Compare July 12, 2023 10:00

dbogunowicz marked this pull request as ready for review July 12, 2023 10:01

dbogunowicz requested a review from bfineran July 12, 2023 10:01

rahul-tuli approved these changes Jul 12, 2023

View reviewed changes

bfineran approved these changes Jul 12, 2023

View reviewed changes

Merge branch 'main' into feature/damian/kv_cache_bloom

7d7e112

dbogunowicz merged commit 3593b1a into main Jul 12, 2023
20 checks passed

dbogunowicz deleted the feature/damian/kv_cache_bloom branch July 12, 2023 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KV Cache] BLOOM support #1664

[KV Cache] BLOOM support #1664

dbogunowicz commented Jul 11, 2023 •

edited

Loading

[KV Cache] BLOOM support #1664

[KV Cache] BLOOM support #1664

Conversation

dbogunowicz commented Jul 11, 2023 • edited Loading

Usage

Feature Preview

Testing with perplexity values

Ground truth (perplexity ignoring BOS token)

Perplexity (KV Cache model)

Perplexity (Non KV Cache model)

dbogunowicz commented Jul 11, 2023 •

edited

Loading