Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Branch][LLM Testing] LLM Testing Suite #1227

Merged

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Sep 6, 2023

Covers the testing as outlined in the internal LLM testing doc. Note, not the full scope is covered here, but the scope required to reliably test the LLM pipelines.

Here is the outline of the introduced tests:

  1. test_freeze_first_position:

    • Tests whether the first token is "frozen" after the KV cache is full.
    • Verifies whether the pipeline correctly handles the behavior of "freezing" the first token.
  2. test_ort_model:

    • Asserts that the ONNX model with KV Cache support runs in ONNXRuntime and produces correct results.
    • Compares the results of the ONNX model with those of the Torch model.
  3. test_ort_single_token_prefill:

    • Tests the scenario where prompt preprocessing is performed by a single-token engine.
    • The KV Cache is never filled up.
    • KV Cache can be managed externally or internally.
  4. test_ort_multi_token_prefill:

    • Tests the scenario where prompt preprocessing is performed by a multi-token engine.
    • The KV Cache is never filled up.
    • KV Cache can be managed externally or internally.
  5. test_ort_generation_after_kv_cache_has_been_filled:

    • Tests the scenario where prompt preprocessing is performed by a multi-token engine.
    • The KV Cache is filled up (old entries are removed).
    • KV Cache can be managed externally or internally.
  6. test_deepsparse_single_token_prefill:

    • Tests the pipeline that uses deepsparse engine with single-token prompt preprocessing.
    • The KV Cache is never filled up.
    • KV Cache can be managed externally or internally.
  7. test_deepsparse_multi_token_prefill:

    • Tests the pipeline that uses deepsparse engine with multi-token prompt preprocessing.
    • The KV Cache is never filled up.
    • KV Cache can be managed externally or internally.
  8. test_deepsparse_generation_after_kv_cache_has_been_filled:

    • Tests the pipeline that uses deepsparse engine with multi-token prompt preprocessing.
    • The KV Cache is filled up (old entries are removed).
    • KV Cache can be managed externally or internally.
  9. test_run_same_prompt_multiple_times:

    • Tests the scenario where the same prompt is run multiple times.
    • Ensures that each run produces the same output.
  10. _test_output (a helper function):

    • Compares the pipeline's output with target values (logits, generated text, and cache).
    • Checks if the generated logits match the target logits.
    • Verifies that the generated text is as expected.
    • Validates the state of the cache.
    • If we generate output from the pipeline where the kv cache buffer has been filled, we use the max abs different between logits as a criterion.
  11. _test_kv_cache_state (a helper function):

    • Compares the expected cache state with the target cache state.
    • Specifically focuses on prompt cache entries and their alignment.

Note: This PR assumes changes from #1198

@dbogunowicz dbogunowicz changed the base branch from main to feature/damian/testing_sources_truth September 6, 2023 10:49
Copy link
Member

@bfineran bfineran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - great job with the inline comments

@dbogunowicz dbogunowicz merged commit 8985a9b into feature/damian/testing_sources_truth Sep 7, 2023
@dbogunowicz dbogunowicz deleted the feature/damian/new_tests branch September 7, 2023 09:03
dbogunowicz added a commit that referenced this pull request Sep 7, 2023
* initial commit

* finish creation of helper objects

* Update tests/conftest.py

* small refactor

* [Feature Branch][LLM Testing] LLM Testing Suite (#1227)

* Update README.md

* Update src/deepsparse/yolov8/README.md

* Update text_generation.py

* quality

* readability

* all tests passing

* added some full kv cache tests

* initial commit

* ready for review

* Delete tests/deepsparse/transformers/pipelines/proposal_text_generation_tests.md
dbogunowicz added a commit that referenced this pull request Sep 13, 2023
* initial commit

* initial commit

* [Feature Branch][LLM Testing] Create GroundTruthSource objects (#1219)

* initial commit

* finish creation of helper objects

* Update tests/conftest.py

* small refactor

* [Feature Branch][LLM Testing] LLM Testing Suite (#1227)

* Update README.md

* Update src/deepsparse/yolov8/README.md

* Update text_generation.py

* quality

* readability

* all tests passing

* added some full kv cache tests

* initial commit

* ready for review

* Delete tests/deepsparse/transformers/pipelines/proposal_text_generation_tests.md

* fix tests

* Dipika's comments plus adjusting the script to renamed variables

* remove ORT ground truth

* add OPT tests

* rebase and disable tests in GHA

* quality
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants