Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KV-Cache Injection][MPT] Update config #1801

Merged
merged 8 commits into from
Nov 3, 2023
Merged

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Oct 31, 2023

Initially, the KV Cache injection was tested on models that predated this diff:
https://huggingface.co/mosaicml/mpt-7b/commit/68e1a8e0ebb9b30f3c45c1ef6195980f29063ae2

Once this diff was applied, the MPT models started assuming different order of dimensions:

image

This meant that even though the KV cache injection terminated without raising errors, the user would experience errors when initializing an engine using the onnx model. This diff fixes the issue.

This diff should resolve @rsnm2 bug described here: https://app.asana.com/0/1205229323407165/1205806737893104

To prevent those kinds of issues in the future, I will be soon working on a feature of the export pathway, that validates the correctness of kv cache injection.

Testing

As a result, the following pathway works once again:

  1. Pull training directory for MPT model (tested stubs zoo:mpt-7b-mpt_pretrain-base_quantized and zoo:mpt-7b-gsm8k_mpt_pretrain-pruned80_quantized)
  2. Export using sparseml.transformers.export
  3. Run inference using Deepsparse Pipeline

@mgoin
Copy link
Member

mgoin commented Oct 31, 2023

@dbogunowicz is this going to break injection on older exports, such as the models on sparsezoo?

@dbogunowicz
Copy link
Contributor Author

@dsikka @mgoin Yeah, good point guys.
This PR is a reactive attempt to enable kv cache injection on the onnx models that were obtained by:

  1. Pulling a sparsezoo training directory (stubs tested: zoo:mpt-7b-mpt_pretrain-base_quantized and zoo:mpt-7b-gsm8k_mpt_pretrain-pruned80_quantized)
  2. Exporting .pth to .onnx using SparseML
  3. Injecting KV cache to the .onnx model using SparseML
  4. Running the resulting .onnx model in Deepsparse

I am still trying to understand whether the issue comes from the fact that we are now using the original transformer version (which was not the case a while ago).

@dbogunowicz
Copy link
Contributor Author

dbogunowicz commented Nov 2, 2023

@dsikka @mgoin @rsnm2 feel free to re-review. In the PR description, I have updated the real cause of why this fix was needed. This fix is compatible with sparsezoo models.

@dbogunowicz dbogunowicz dismissed dsikka’s stale review November 3, 2023 13:35

Not relevant anymore

@dbogunowicz dbogunowicz merged commit 4e59d69 into main Nov 3, 2023
11 checks passed
@dbogunowicz dbogunowicz deleted the dbogunowicz-patch-2 branch November 3, 2023 13:36
bfineran pushed a commit that referenced this pull request Nov 16, 2023
* Update export.py

* quality

* Update configs.py

* add comment regarding MPT version
bfineran pushed a commit that referenced this pull request Nov 16, 2023
* Update export.py

* quality

* Update configs.py

* add comment regarding MPT version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants