Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KV Cache Interface] Text Generation & Decoder Engine Implementation #1089

Merged
merged 101 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from 80 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
48ac0ac
initial commit
dbogunowicz Jun 5, 2023
cf7f2b9
Update src/deepsparse/license.py
dbogunowicz Jun 5, 2023
832630a
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 6, 2023
9958c83
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 7, 2023
e6d2b03
limit to 150mb
dbogunowicz Jun 7, 2023
7f9935b
ready to review
dbogunowicz Jun 7, 2023
b1cf01b
initial commit
dbogunowicz Mar 2, 2023
0a3f48d
[Codegen][ORT][Static Seq Length] TextGenerationPipeline (#946)
dbogunowicz Mar 16, 2023
add4625
[CodeGen][Documentation] (#956)
dbogunowicz Mar 23, 2023
22d2746
reimplementation for generative pipelines
markurtz May 8, 2023
7f1651d
restore text generation from examples
dbogunowicz May 8, 2023
b85746d
[CodeGen] ONNX model loading to support >2Gb models / two engines (#991)
dbogunowicz May 8, 2023
aadc608
refactor sucessfull
dbogunowicz May 10, 2023
58bc2b0
Pipeline fully refactored, time to test engine support. Note: Sliding…
dbogunowicz May 11, 2023
d538444
First iteration with Sage
dbogunowicz May 11, 2023
e19676b
Apply suggestions from code review
dbogunowicz May 11, 2023
7908b74
ORT agrees with the Engine. But they both give not entirely correct r…
dbogunowicz May 11, 2023
4bc3472
dynamic ORT vs static DS
dbogunowicz May 12, 2023
c07f7ed
pipeline handles OPT multitoken pass
dbogunowicz May 16, 2023
fb77838
fixes to get static pipeline a little further along
bfineran May 16, 2023
2097463
adjust shapes and slicing to enable static autoregressive pass - ISSU…
bfineran May 17, 2023
5eb10a9
migrate from cache_length to positions input
bfineran May 18, 2023
9213f29
got if working for multitoken + single token scenario
dbogunowicz May 18, 2023
d9af004
cleanup the pipeline
dbogunowicz May 19, 2023
476f25d
further cleanup post merge
dbogunowicz May 19, 2023
fab44e4
Pipeline working for single-token inference only
dbogunowicz May 19, 2023
d454e2f
do not load the onnx model with external files twice
dbogunowicz May 19, 2023
1613e25
pipeline never redundantly saves the external data + more robust toke…
dbogunowicz May 19, 2023
b61055c
Stop saving tmp files, otherwise the engine looks for external files …
dbogunowicz May 19, 2023
6ee25fc
Left pad support
bfineran May 19, 2023
5d3004b
cleanup
dbogunowicz May 22, 2023
ace6fa5
cleanup2
dbogunowicz May 22, 2023
388586d
Add in pipeline timing
markurtz May 24, 2023
afd0139
add in force tokens logic
markurtz May 24, 2023
30eeda7
remove input validation for text generation pipelines
markurtz May 24, 2023
5882b56
remove multitoken support for now
markurtz May 24, 2023
4bbe33d
remove kv cache engine and other fixes
markurtz May 25, 2023
afa5746
nest input shape override
markurtz May 25, 2023
e2bb78c
comment out input shape override
markurtz May 25, 2023
2299009
add non batch override for ORT
markurtz May 25, 2023
2935b77
clean up generation pipeline
markurtz Jun 9, 2023
b89b156
Merge branch 'main' into feature/damian/do_not_save_to_tmp
dbogunowicz Jun 11, 2023
dc3d61b
initial commit
dbogunowicz Jun 5, 2023
a294265
Update src/deepsparse/license.py
dbogunowicz Jun 5, 2023
af97f2b
limit to 150mb
dbogunowicz Jun 7, 2023
c117788
ready to review
dbogunowicz Jun 7, 2023
4ad5f49
fix the erronous Makefile
dbogunowicz Jun 13, 2023
9e816bb
Merge branch 'feature/damian/do_not_save_to_tmp' of https://github.co…
dbogunowicz Jun 13, 2023
f97467f
perhaps fixed GHA
dbogunowicz Jun 13, 2023
6be8d87
take into consideration that GHA creates four files
dbogunowicz Jun 13, 2023
e2f088d
initial commit
dbogunowicz Jun 13, 2023
9fc6c64
Merge remote-tracking branch 'origin/feature/damian/do_not_save_to_tm…
dbogunowicz Jun 13, 2023
a610faf
tested with actual model
dbogunowicz Jun 13, 2023
347d1fb
remove val_inp argument
dbogunowicz Jun 13, 2023
e11027c
Update README.md
dbogunowicz Jun 13, 2023
a950910
Apply suggestions from code review
dbogunowicz Jun 13, 2023
c1d02dc
Update README.md
dbogunowicz Jun 13, 2023
711cdfb
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 13, 2023
e602662
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 14, 2023
06b5246
Merge branch 'main' into feature/damian/codegen_pipeline_clean
dbogunowicz Jun 16, 2023
5d59d23
initial implementation
dbogunowicz Jun 21, 2023
765a5f7
initial implementation
dbogunowicz Jun 21, 2023
15586a4
Revert "initial implementation"
dbogunowicz Jun 21, 2023
4d35779
rebase
dbogunowicz Jun 21, 2023
775c648
add tests
dbogunowicz Jun 21, 2023
54aec69
Merge branch 'feature/damian/codegen_pipeline_clean' of https://githu…
dbogunowicz Jun 21, 2023
25cdd38
strip down complexity out of text generation pipeline
dbogunowicz Jun 21, 2023
830a85e
Merge branch 'feature/damian/fb_kv_cache' into feature/damian/kv_cach…
dbogunowicz Jun 22, 2023
388e7ab
Merge branch 'feature/damian/kv_cache_ort' into feature/damian/decode…
dbogunowicz Jun 22, 2023
3970a7a
initial implementation
dbogunowicz Jun 22, 2023
7cdf939
Merge branch 'feature/damian/decoder_kv_cache' into feature/damian/de…
dbogunowicz Jun 22, 2023
950c653
In a good state for the review on 22.06
dbogunowicz Jun 22, 2023
ea82e99
remove files to make review easier
dbogunowicz Jun 22, 2023
016cac1
Revert "remove files to make review easier"
dbogunowicz Jun 22, 2023
c6efccd
Merge DecoderKVCache with KVCacheORT (KVCacheORT will not exist, it i…
dbogunowicz Jun 22, 2023
a19cf2e
Delete decoder_kv_cache.py
dbogunowicz Jun 22, 2023
c59da37
Delete test_decoder_kv_cache.py
dbogunowicz Jun 22, 2023
6d40c03
DecoderKVCache that manipulates cache state and additionally passes i…
dbogunowicz Jun 22, 2023
741f452
merge the functionalities of the engine and the decoder
dbogunowicz Jun 23, 2023
7b27abe
fix formatting of the transformers/utils/__init__.py
dbogunowicz Jun 23, 2023
db6b54b
improvements after the sync with Mark
dbogunowicz Jun 26, 2023
b3fb3b8
Merge branch 'feature/damian/fb_kv_cache' into feature/damian/kv_cach…
dbogunowicz Jun 26, 2023
47c0c4b
Merge remote-tracking branch 'origin/feature/damian/kv_cache_ort' int…
dbogunowicz Jun 26, 2023
76e332d
All changes applied, time for testing
dbogunowicz Jun 26, 2023
4791ed3
Merge remote-tracking branch 'origin/feature/damian/kv_cache_ort' int…
dbogunowicz Jun 26, 2023
8c5734b
Scaffolding to also run multitoken
dbogunowicz Jun 26, 2023
6c5daab
add delay_overwriting_inputs
dbogunowicz Jun 26, 2023
812408c
multitoken is working (although in limited capacity)
dbogunowicz Jun 27, 2023
952abda
fix no kv cache inference
dbogunowicz Jun 27, 2023
2ff4987
Do not create engine if not needed
dbogunowicz Jun 27, 2023
725a210
remove the prefill option
dbogunowicz Jun 27, 2023
108596e
fix docstring
dbogunowicz Jun 27, 2023
f6a9baf
remove prefill
dbogunowicz Jun 27, 2023
b25886a
fix the computation of total cache capacity
dbogunowicz Jun 27, 2023
53d7b70
Merge remote-tracking branch 'origin/feature/damian/kv_cache_ort' int…
dbogunowicz Jun 27, 2023
0b0f74a
merge
dbogunowicz Jun 27, 2023
4d6860a
Merge branch 'feature/damian/fb_kv_cache' into feature/damian/kv_cach…
dbogunowicz Jun 27, 2023
d68f045
Merge branch 'feature/damian/kv_cache_ort' into feature/damian/decode…
dbogunowicz Jun 27, 2023
759dc93
addressed PR comments
dbogunowicz Jun 28, 2023
3e1d32f
merge
dbogunowicz Jun 28, 2023
4c39d7f
quality
dbogunowicz Jun 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 2 additions & 47 deletions src/deepsparse/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,6 @@
"Scheduler",
"Context",
"MultiModelEngine",
"KVCacheEngine",
"BaseEngine",
]

Expand Down Expand Up @@ -292,6 +291,7 @@ def __init__(
self._num_streams,
self._scheduler.value,
None,
self._kv_cache_input_idxs,
)
else:
self._eng_net = LIB.deepsparse_engine(
Expand All @@ -301,6 +301,7 @@ def __init__(
self._num_streams,
self._scheduler.value,
None,
self._kv_cache_input_idxs,
)

def __call__(
Expand Down Expand Up @@ -845,52 +846,6 @@ def __init__(
)


class KVCacheEngine(Engine):
"""
Engine that can do kv caching.
"""

def __init__(
self,
model: Union[str, "Model", "File"],
batch_size: int = 1,
num_cores: int = None,
num_streams: int = None,
scheduler: Scheduler = None,
input_shapes: List[List[int]] = None,
kv_cache_bools: List[bool] = None,
prev_cache_length: int = 0,
):
BaseEngine.construct(
self, model, batch_size, num_cores, num_streams, scheduler, input_shapes
)

if kv_cache_bools is None:
# If no list was provided, then we assume all outputs except for the first are KV caches
# Note: In the future we can look at the names of outputs to be more sure
#
# Create a boolean list of every output of the model
output_names = get_output_names(self._model_path)
kv_cache_bools = [True for i in range(len(output_names))]
# Assume first input is logits and logits ought not to be cached
kv_cache_bools[0] = False

num_streams = _validate_num_streams(num_streams, self._num_cores)
if self._input_shapes:
raise NotImplementedError("Don't do this yet :)")
else:
self._eng_net = LIB.deepsparse_engine(
self._model_path,
self._batch_size,
self._num_cores,
num_streams,
self._scheduler.value,
None,
kv_cache_bools,
prev_cache_length,
)


def compile_model(
model: Union[str, "Model", "File"],
batch_size: int = 1,
Expand Down
54 changes: 33 additions & 21 deletions src/deepsparse/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ def __call__(self, *args, **kwargs) -> BaseModel:
batches = self.split_engine_inputs(engine_inputs, self._batch_size)

# submit split batches to engine threadpool
batch_outputs = list(self.executor.map(self.engine_forward, batches))
batch_outputs = [self.engine_forward(x) for x in batches]

# join together the batches of size `self._batch_size`
engine_outputs = self.join_engine_outputs(batch_outputs)
Expand Down Expand Up @@ -567,6 +567,34 @@ def _register_pipeline_tasks_decorator(pipeline_class: Pipeline):

return _register_pipeline_tasks_decorator

@staticmethod
def create_engine(
onnx_file_path: str,
engine_type: str,
engine_args: Dict,
context: Optional[Context] = None,
) -> Union[Engine, MultiModelEngine, ORTEngine]:
engine_type = engine_type.lower()

if engine_type == DEEPSPARSE_ENGINE:
if context is not None and isinstance(context, Context):
engine_args.pop("num_cores", None)
engine_args.pop("scheduler", None)
engine_args["context"] = context
return MultiModelEngine(
model=onnx_file_path,
**engine_args,
)
return Engine(onnx_file_path, **engine_args)

if engine_type == ORT_ENGINE:
return ORTEngine(onnx_file_path, **engine_args)

raise ValueError(
f"Unknown engine_type {engine_type}. Supported values include: "
f"{SUPPORTED_PIPELINE_ENGINES}"
)

@classmethod
def from_config(
cls,
Expand Down Expand Up @@ -791,26 +819,10 @@ def engine_forward(self, engine_inputs: List[numpy.ndarray]) -> List[numpy.ndarr
"""
return self.engine(engine_inputs)

def _initialize_engine(self) -> Union[Engine, ORTEngine]:
engine_type = self.engine_type.lower()

if engine_type == DEEPSPARSE_ENGINE:
if self.context is not None and isinstance(self.context, Context):
self._engine_args.pop("num_cores", None)
self._engine_args.pop("scheduler", None)
self._engine_args["context"] = self.context
return MultiModelEngine(
model=self.onnx_file_path,
**self._engine_args,
)
return Engine(self.onnx_file_path, **self._engine_args)
elif engine_type == ORT_ENGINE:
return ORTEngine(self.onnx_file_path, **self._engine_args)
else:
raise ValueError(
f"Unknown engine_type {self.engine_type}. Supported values include: "
f"{SUPPORTED_PIPELINE_ENGINES}"
)
def _initialize_engine(self) -> Union[Engine, MultiModelEngine, ORTEngine]:
return Pipeline.create_engine(
self.onnx_file_path, self.engine_type, self._engine_args, self.context
)

def _identifier(self):
# get pipeline identifier; used in the context of logging
Expand Down
23 changes: 23 additions & 0 deletions src/deepsparse/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,12 @@ class SupportedTasks:
),
)

text_generation = namedtuple("text_generation", ["opt", "codegen", "bloom"])(
codegen=AliasedTask("codegen", []),
opt=AliasedTask("opt", []),
bloom=AliasedTask("bloom", []),
)

image_classification = namedtuple("image_classification", ["image_classification"])(
image_classification=AliasedTask(
"image_classification",
Expand Down Expand Up @@ -150,6 +156,9 @@ def check_register_task(
# custom task, register the CustomPipeline
import deepsparse.pipelines.custom_pipeline # noqa: F401

elif cls.is_text_generation(task):
import deepsparse.transformers.pipelines.text_generation # noqa: F401

elif cls.is_nlp(task):
# trigger transformers pipelines to register with Pipeline.register
import deepsparse.transformers.pipelines # noqa: F401
Expand Down Expand Up @@ -193,6 +202,20 @@ def check_register_task(
f"{list(all_tasks)}"
)

@classmethod
def is_text_generation(cls, task: str) -> bool:
"""
:param task: the name of the task to check whether it is a text generation task
such as codegen
:return: True if it is a text generation task, False otherwise
"""
return any(
[
text_generation_task.matches(task)
for text_generation_task in cls.text_generation
]
)

@classmethod
def is_nlp(cls, task: str) -> bool:
"""
Expand Down
50 changes: 47 additions & 3 deletions src/deepsparse/transformers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ methods such as [pruning](https://neuralmagic.com/blog/pruning-overview/) and [q
These techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics.

This integration currently supports several fundamental NLP tasks:
- **Text Generation** - given the input prompt, generate an output text sequence (e.g. to fill in incomplete text or paraphrase part of the prompt)
- **Question Answering** - posing questions about a document
- **Sentiment Analysis** - assigning a sentiment to a piece of text
- **Text Classification** - assigning a label or class to a piece of text (e.g duplicate question pairing)
Expand All @@ -30,10 +31,12 @@ compatible with our [hardware requirements](https://docs.neuralmagic.com/deepspa
By default, to deploy the transformer using DeepSparse Engine it is required to supply the model in the ONNX format along with the HuggingFace supporting files.
This grants the engine the flexibility to serve any model in a framework-agnostic environment.

The DeepSparse pipelines require the following files within a folder on the local server to properly load a Transformers model:
In general, the DeepSparse pipelines require the following files within a folder on the local server to properly load a Transformers model:
- `model.onnx`: The exported Transformers model in the [ONNX format](https://github.com/onnx/onnx).
- `tokenizer.json`: The [HuggingFace compatible tokenizer configuration](https://huggingface.co/docs/transformers/fast_tokenizers) used with the model.
- `model_kvcache.onnx` (optional): the ONNX model with the KV Cache support (akin to the Transformers model with `use_cache = True`. Specific for the `text-generation` integration.
- `config.json`: The [HuggingFace compatible configuration file](https://huggingface.co/docs/transformers/main_classes/configuration) used with the model.
- `tokenizer_config.json`: The [HuggingFace compatible tokenizer configuration](https://huggingface.co/docs/transformers/fast_tokenizers) used with the model.
- `tokenizer.json`, `special_tokens_map.json`, `vocab.json`, `merges.txt` (optional): Other files that may be required by a tokenizer

Below we describe two possibilities to obtain the required structure.

Expand All @@ -48,7 +51,7 @@ sparseml.transformers.export_onnx --task question-answering --model_path model_p
```

This creates `model.onnx` file, in the directory of your `model_path`(e.g. `/trained_model/model.onnx`).
The `tokenizer.json` and `config.json` are stored under the `model_path` folder as well, so a DeepSparse pipeline ca be directly instantiated by using that folder after export (e.g. `/trained_model/`).
Any additional, required files, such as e.g.`tokenizer.json` or `config.json`, are stored under the `model_path` folder as well, so a DeepSparse pipeline ca be directly instantiated by using that folder after export (e.g. `/trained_model/`).

#### SparseZoo Stub
Alternatively, you can skip the process of the ONNX model export by using Neural Magic's [SparseZoo](https://sparsezoo.neuralmagic.com/). The SparseZoo contains pre-sparsified models and SparseZoo stubs enable you to reference any model on the SparseZoo in a convenient and predictable way.
Expand Down Expand Up @@ -137,6 +140,47 @@ response.text

>> '{"score":0.9534820914268494,"start":8,"end":14,"answer":"batman"}'
```
### Text Generation
The text generation task generates a sequence of words given the prompt. Popular text generation LLMs (Large Language Models) are used
for the chats (the instruction models), code generation, text summarization, or filling out the missing text.
are used for chats or following instructions are also covered in this task. The following example uses a sparsified text classification
OPT model to complete the prompt

[List of available SparseZoo Text Generation Models](
https://sparsezoo.neuralmagic.com/?useCase=text_generation)

#### Python Pipeline
```python
from deepsparse import Pipeline

opt_pipeline = Pipeline.create(task="opt")

inference = opt_pipeline("Who is the president of the United States?")

>> 'The president of the United States is the head of the executive branch of government...'
```

#### HTTP Server
Spinning up:
```bash
deepsparse.server \
task text-generation \
--model_path # TODO: Pending until text generation models get uploaded to SparseZoo
```

Making a request:
```python
import requests

url = "http://localhost:5543/predict" # Server's port default to 5543

obj = {"sequence": "Who is the president of the United States?"}

response = requests.post(url, json=obj)
response.text

>> 'The president of the United States is the head of the executive branch of government...'
```

### Sentiment Analysis
The sentiment analysis task takes in a sentence and classifies its sentiment. The following example
Expand Down
15 changes: 15 additions & 0 deletions src/deepsparse/transformers/engines/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# flake8: noqa
from .nl_decoder_engine import *
Loading