Perplexity Eval for Text Generation Models #1073

dbogunowicz · 2023-06-15T17:55:44Z

Feature description

A pathway to compute a perplexity of a text generation pipeline given a dataset

Example:

python eval_downstream.py path/to/model.onnx --dataset openai_humaneval --engine onnxruntime --max-samples 16

> openai_humaneval eval results: {'mean_perplexity': 12.780163526535034, 'perplexities': [6.606388092041016, 9.294905662536621, 17.560449600219727, ...

vs

dataset_name="openai_humaneval"
dataset = load_dataset(dataset_name)["test"]

perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = []
for idx, sample in enumerate(dataset):
    input_texts.append(sample["prompt"] + sample["canonical_solution"])
    if idx == 16:
        break
results = perplexity.compute(model_id="facebook/opt-350m",
                             add_start_token=True,
                             predictions=input_texts)
print(results)

> {'perplexities': [6.606388092041016, 9.294907569885254, 17.560461044311523, 13.867135047912598, 17.79571533203125, 12.038525581359863], 'mean_perplexity': 12.780163526535034}

Notable Features

Implemented the deepsparse port of the HF perplexity metric (https://huggingface.co/spaces/evaluate-metric/perplexity)
Perplexity evaluation of models WITH kv cache and WITHOUT kv cache support (depends on what type of model model.onnx is)
Perplexity in both scenarios (cache/no cache for both OPT and CodeGen models) agree exactly with the benchmark, that is the huggingface perplexity calculation
Since we are directly reusing the pipeline's tokenizer for computing the perplexity, my port removes the need for add_start_token argument - this is detected by the pipeline
Validated the batched inference of the text generation pipeline

Note: For now, because of the limitation of the kv cache injection (being unable to run "multi-token autoregressive inference"), when kv cache is enabled, the tokens run exclusively through the single-token engine. Once we overcome this hurdle, I'll revisit the perplexity evaluation to ensure that the pipeline's final design is properly supported.

Additional testing

Make sure that a standard, generative inference with cache and short prompts (single-token engine only) still runs fine
Make sure that a standard, generative inference with cache and long prompts (single-token and multi-token engine) still runs fine

* initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate

* initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * initial commit * change order * Update examples/codegen/README.md Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> --------- Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com>

… window not yet implemented!

…esult. Hey, this is good news still

…E: tokens past the base seq len are repeated

…nizer

…in the wrong place

…ian/text_gen_perplexity

src/deepsparse/pipeline.py

src/deepsparse/transformers/engines/nl_decoder_engine.py

dbogunowicz · 2023-06-29T12:14:01Z

src/deepsparse/transformers/engines/nl_decoder_engine.py

-        logits = logits[:, -1, :].reshape(B, 1, V)  # only take the last token
-
-        token = self.generate_token(logits=logits)
+        token = self.generate_token(logits=logits[:, -1, :])

        return token, logits


We need all the logits that are predicted from sequences: {}, {x1}, {x1, x2}, ... {x1, x2, ... x_n}

bfineran · 2023-07-03T14:58:30Z

src/deepsparse/transformers/pipelines/text_generation.py

@@ -89,6 +96,10 @@ class TextGenerationPipeline(TransformersPipeline):
        of tokens supplied even if the stop token is reached.
    :param use_deepsparse_cache: if True, the pipeline will use the deepsparse kv cache
        for caching the model outputs.
+    :param tokenizer_padding_side: the side to pad the input sequence to.


as discussed offline - running right padded for eval will likely not work for the engine (single token prefill) as internally they will build the KV cache assuming left padded and pop from the left side of cache as its built up. in right padded scenario I believe this will delete the actual non-padded values from cache too early.

* initial commit * Update src/deepsparse/license.py * limit to 150mb * ready to review * initial commit * [Codegen][ORT][Static Seq Length] TextGenerationPipeline (#946) * initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * [CodeGen][Documentation] (#956) * initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * initial commit * change order * Update examples/codegen/README.md Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> --------- Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> * reimplementation for generative pipelines * restore text generation from examples * [CodeGen] ONNX model loading to support >2Gb models / two engines (#991) * refactor sucessfull * Pipeline fully refactored, time to test engine support. Note: Sliding window not yet implemented! * First iteration with Sage * Apply suggestions from code review * ORT agrees with the Engine. But they both give not entirely correct result. Hey, this is good news still * dynamic ORT vs static DS * pipeline handles OPT multitoken pass * fixes to get static pipeline a little further along * adjust shapes and slicing to enable static autoregressive pass - ISSUE: tokens past the base seq len are repeated * migrate from cache_length to positions input * got if working for multitoken + single token scenario * cleanup the pipeline * further cleanup post merge * Pipeline working for single-token inference only * do not load the onnx model with external files twice * pipeline never redundantly saves the external data + more robust tokenizer * Stop saving tmp files, otherwise the engine looks for external files in the wrong place * Left pad support * cleanup * cleanup2 * Add in pipeline timing * add in force tokens logic * remove input validation for text generation pipelines * remove multitoken support for now * remove kv cache engine and other fixes * nest input shape override * comment out input shape override * add non batch override for ORT * clean up generation pipeline * initial commit * Update src/deepsparse/license.py * limit to 150mb * ready to review * fix the erronous Makefile * perhaps fixed GHA * take into consideration that GHA creates four files * initial commit * tested with actual model * remove val_inp argument * Update README.md * Apply suggestions from code review * Update README.md * [BugFix] Update deepsparse dockerfile (#1069) * Remove autoinstall triggering commands * Fix typo * initial implementation * working implementation for pipeline input * [Fix] Fix CLI benchmark errors (#1071) * initial commit * ready for review * Update src/deepsparse/utils/onnx.py * Clean a typo in the pipeline code * initial commit * [KV Cache Interface] DecoderKVCache (#1084) * initial implementation * initial implementation * Revert "initial implementation" This reverts commit 765a5f7. * Merge DecoderKVCache with KVCacheORT (KVCacheORT will not exist, it is just an abstraction) * rebase * add tests * DecoderKVCache that manipulates cache state and additionally passes info to the engine via KVCache object * improvements after the sync with Mark * remove prefill * fix the computation of total cache capacity * address PR comments * [WiP] [KV Cache Interface] Text Generation & Decoder Engine Implementation (#1089) * initial commit * Update src/deepsparse/license.py * limit to 150mb * ready to review * initial commit * [Codegen][ORT][Static Seq Length] TextGenerationPipeline (#946) * initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * [CodeGen][Documentation] (#956) * initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * initial commit * change order * Update examples/codegen/README.md Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> --------- Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> * reimplementation for generative pipelines * restore text generation from examples * [CodeGen] ONNX model loading to support >2Gb models / two engines (#991) * refactor sucessfull * Pipeline fully refactored, time to test engine support. Note: Sliding window not yet implemented! * First iteration with Sage * Apply suggestions from code review * ORT agrees with the Engine. But they both give not entirely correct result. Hey, this is good news still * dynamic ORT vs static DS * pipeline handles OPT multitoken pass * fixes to get static pipeline a little further along * adjust shapes and slicing to enable static autoregressive pass - ISSUE: tokens past the base seq len are repeated * migrate from cache_length to positions input * got if working for multitoken + single token scenario * cleanup the pipeline * further cleanup post merge * Pipeline working for single-token inference only * do not load the onnx model with external files twice * pipeline never redundantly saves the external data + more robust tokenizer * Stop saving tmp files, otherwise the engine looks for external files in the wrong place * Left pad support * cleanup * cleanup2 * Add in pipeline timing * add in force tokens logic * remove input validation for text generation pipelines * remove multitoken support for now * remove kv cache engine and other fixes * nest input shape override * comment out input shape override * add non batch override for ORT * clean up generation pipeline * initial commit * Update src/deepsparse/license.py * limit to 150mb * ready to review * fix the erronous Makefile * perhaps fixed GHA * take into consideration that GHA creates four files * initial commit * tested with actual model * remove val_inp argument * Update README.md * Apply suggestions from code review * Update README.md * initial implementation * initial implementation * Revert "initial implementation" This reverts commit 765a5f7. * rebase * add tests * strip down complexity out of text generation pipeline * initial implementation * In a good state for the review on 22.06 * remove files to make review easier * Revert "remove files to make review easier" This reverts commit ea82e99. * Merge DecoderKVCache with KVCacheORT (KVCacheORT will not exist, it is just an abstraction) * rebase * add tests * Delete decoder_kv_cache.py * Delete test_decoder_kv_cache.py * DecoderKVCache that manipulates cache state and additionally passes info to the engine via KVCache object * fix formatting of the transformers/utils/__init__.py * improvements after the sync with Mark * All changes applied, time for testing * Scaffolding to also run multitoken * add delay_overwriting_inputs * multitoken is working (although in limited capacity) * fix no kv cache inference * Do not create engine if not needed * remove the prefill option * fix docstring * remove prefill * fix the computation of total cache capacity * merge * addressed PR comments * quality --------- Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> Co-authored-by: Mark Kurtz <mark.kurtz@neuralmagic.com> Co-authored-by: Benjamin <ben@neuralmagic.com> * now kv cache decoder holds information about the num of tokens preprocessed. also encountered first bug when running with the engine * cleanup the old files * Update src/deepsparse/transformers/engines/nl_decoder_engine.py * ready for review * ready for testing * managed to get first logits right * Delete example * cleanup before sharing with Ben and Sage * Update src/deepsparse/transformers/engines/nl_decoder_engine.py * assert proper padding on pipeline init * now also supporting kv cache perplexity. time for cleanup * ready for review * correctly print engine info * work with left padding of the tokenizer * quality * fix the multitoken inference * Perplexity Eval for Text Generation Models (#1073) * initial commit * Update src/deepsparse/license.py * limit to 150mb * ready to review * initial commit * [Codegen][ORT][Static Seq Length] TextGenerationPipeline (#946) * initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * [CodeGen][Documentation] (#956) * initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * initial commit * change order * Update examples/codegen/README.md Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> --------- Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> * reimplementation for generative pipelines * restore text generation from examples * [CodeGen] ONNX model loading to support >2Gb models / two engines (#991) * refactor sucessfull * Pipeline fully refactored, time to test engine support. Note: Sliding window not yet implemented! * First iteration with Sage * Apply suggestions from code review * ORT agrees with the Engine. But they both give not entirely correct result. Hey, this is good news still * dynamic ORT vs static DS * pipeline handles OPT multitoken pass * fixes to get static pipeline a little further along * adjust shapes and slicing to enable static autoregressive pass - ISSUE: tokens past the base seq len are repeated * migrate from cache_length to positions input * got if working for multitoken + single token scenario * cleanup the pipeline * further cleanup post merge * Pipeline working for single-token inference only * do not load the onnx model with external files twice * pipeline never redundantly saves the external data + more robust tokenizer * Stop saving tmp files, otherwise the engine looks for external files in the wrong place * Left pad support * cleanup * cleanup2 * Add in pipeline timing * add in force tokens logic * remove input validation for text generation pipelines * remove multitoken support for now * remove kv cache engine and other fixes * nest input shape override * comment out input shape override * add non batch override for ORT * clean up generation pipeline * initial commit * Update src/deepsparse/license.py * limit to 150mb * ready to review * fix the erronous Makefile * perhaps fixed GHA * take into consideration that GHA creates four files * initial commit * tested with actual model * remove val_inp argument * Update README.md * Apply suggestions from code review * Update README.md * [BugFix] Update deepsparse dockerfile (#1069) * Remove autoinstall triggering commands * Fix typo * initial implementation * working implementation for pipeline input * [Fix] Fix CLI benchmark errors (#1071) * initial commit * ready for review * Update src/deepsparse/utils/onnx.py * Clean a typo in the pipeline code * cleanup the old files * Update src/deepsparse/transformers/engines/nl_decoder_engine.py * ready for review * ready for testing * assert proper padding on pipeline init * now also supporting kv cache perplexity. time for cleanup * ready for review * correctly print engine info * work with left padding of the tokenizer * quality * fix the multitoken inference --------- Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> Co-authored-by: Mark Kurtz <mark.kurtz@neuralmagic.com> Co-authored-by: Benjamin <ben@neuralmagic.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> * [Text Generation] Run deepsparse engine without the LIB.kv_cache object (#1108) * Update src/deepsparse/transformers/engines/nl_decoder_engine.py * fixed the logic to assert correct multibatch inference * fix integration tests * initial implementation * fix the integration test * better solution for fixing the issues caused by this PR in GHA * revert changes to yolo pipeline * Update src/deepsparse/transformers/engines/nl_decoder_engine.py Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> * response to Rahuls comments --------- Co-authored-by: Mark Kurtz <mark.kurtz@neuralmagic.com> Co-authored-by: Benjamin <ben@neuralmagic.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>

dbogunowicz and others added 30 commits June 5, 2023 15:55

initial commit

48ac0ac

Update src/deepsparse/license.py

cf7f2b9

Merge branch 'main' into feature/damian/do_not_save_to_tmp

832630a

Merge branch 'main' into feature/damian/do_not_save_to_tmp

9958c83

limit to 150mb

e6d2b03

ready to review

7f9935b

initial commit

b1cf01b

[Codegen][ORT][Static Seq Length] TextGenerationPipeline (#946)

0a3f48d

* initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate

reimplementation for generative pipelines

22d2746

restore text generation from examples

7f1651d

[CodeGen] ONNX model loading to support >2Gb models / two engines (#991)

b85746d

refactor sucessfull

aadc608

Pipeline fully refactored, time to test engine support. Note: Sliding…

58bc2b0

… window not yet implemented!

First iteration with Sage

d538444

Apply suggestions from code review

e19676b

ORT agrees with the Engine. But they both give not entirely correct r…

7908b74

…esult. Hey, this is good news still

dynamic ORT vs static DS

4bc3472

pipeline handles OPT multitoken pass

c07f7ed

fixes to get static pipeline a little further along

fb77838

adjust shapes and slicing to enable static autoregressive pass - ISSU…

2097463

…E: tokens past the base seq len are repeated

migrate from cache_length to positions input

5eb10a9

got if working for multitoken + single token scenario

9213f29

cleanup the pipeline

d9af004

further cleanup post merge

476f25d

Pipeline working for single-token inference only

fab44e4

do not load the onnx model with external files twice

d454e2f

pipeline never redundantly saves the external data + more robust toke…

1613e25

…nizer

Stop saving tmp files, otherwise the engine looks for external files …

b61055c

…in the wrong place

Left pad support

6ee25fc

dbogunowicz added 2 commits June 16, 2023 12:00

Merge branch 'main' into feature/damian/codegen_pipeline_clean

06b5246

Merge branch 'feature/damian/codegen_pipeline_clean' into feature/dam…

2cab681

…ian/text_gen_perplexity

dbogunowicz commented Jun 16, 2023

View reviewed changes

src/deepsparse/pipeline.py Outdated Show resolved Hide resolved

Clean a typo in the pipeline code

63b116b

dbogunowicz requested review from markurtz, bfineran and eldarkurtic June 16, 2023 11:24

dbogunowicz changed the base branch from feature/damian/codegen_pipeline_clean to feature/damian/fb_kv_cache June 28, 2023 13:09

dbogunowicz added 2 commits June 29, 2023 05:51

working implementation, time to cleanup

7001a6e

cleanup the old files

79251e6

dbogunowicz commented Jun 29, 2023

View reviewed changes

src/deepsparse/transformers/engines/nl_decoder_engine.py Outdated Show resolved Hide resolved

Update src/deepsparse/transformers/engines/nl_decoder_engine.py

9efbdb6

dbogunowicz commented Jun 29, 2023

View reviewed changes

dbogunowicz added 5 commits June 29, 2023 12:19

ready for review

da5e93e

ready for testing

a680dac

assert proper padding on pipeline init

f83dcab

now also supporting kv cache perplexity. time for cleanup

e659c33

ready for review

cf74ad7

bfineran requested changes Jul 3, 2023

View reviewed changes

dbogunowicz added 2 commits July 3, 2023 15:12

correctly print engine info

853f876

work with left padding of the tokenizer

e8da07e

dbogunowicz requested review from bfineran and shubhra July 3, 2023 17:00

quality

58b12c8

bfineran approved these changes Jul 3, 2023

View reviewed changes

fix the multitoken inference

eecd232

dbogunowicz merged commit 10c804a into feature/damian/fb_kv_cache Jul 5, 2023

dbogunowicz deleted the feature/damian/text_gen_perplexity branch July 5, 2023 10:04

dbogunowicz mentioned this pull request Jul 5, 2023

[Feature Branch] KV Cache Interface #1083

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perplexity Eval for Text Generation Models #1073

Perplexity Eval for Text Generation Models #1073

dbogunowicz commented Jun 15, 2023 •

edited

Loading

dbogunowicz Jun 29, 2023

bfineran Jul 3, 2023

Perplexity Eval for Text Generation Models #1073

Perplexity Eval for Text Generation Models #1073

Conversation

dbogunowicz commented Jun 15, 2023 • edited Loading

Feature description

Notable Features

Additional testing

dbogunowicz Jun 29, 2023

Choose a reason for hiding this comment

bfineran Jul 3, 2023

Choose a reason for hiding this comment

dbogunowicz commented Jun 15, 2023 •

edited

Loading