Skip to content

Commit

Permalink
Merge pull request #131 from ELS-RD/feat/add-t5-support
Browse files Browse the repository at this point in the history
Feat/add t5 support
  • Loading branch information
ayoub-louati authored Jun 1, 2023
2 parents f6dde83 + 9eaa451 commit 6b88e24
Show file tree
Hide file tree
Showing 23 changed files with 2,120 additions and 167 deletions.
29 changes: 16 additions & 13 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,24 +12,27 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
- uses: actions/checkout@v3

- uses: actions/setup-python@v3
with:
python-version: "3.9"
python-version: '3.9'
cache: 'pip' # caching pip dependencies

- name: Install dependencies
run: |
pip install -U pip
python -m pip install --upgrade pip
pip install ".[CPU]" -f https://download.pytorch.org/whl/cpu/torch_stable.html
pip install sentence-transformers
- name: update pip
run: python -m pip install --upgrade pip

- name: linter and tests
- name: install test dependencies
run: |
make source_code_check_format
make test_ci
pip3 install sentence-transformers
pip3 install nvidia-pyindex
- name: install package
run: pip3 install ".[CPU]" --extra-index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.ngc.nvidia.c

- name: test
run: make test_ci

- name: read VERSION file
id: getversion
Expand Down
39 changes: 34 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,9 +1,38 @@
FROM nvcr.io/nvidia/tritonserver:22.07-py3

# see .dockerignore to check what is transfered
COPY . ./

RUN pip3 install -U pip && \
pip3 install nvidia-pyindex && \
pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu116/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir && \
pip3 install sentence-transformers notebook pytorch-quantization ipywidgets
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
python3-dev \
python3-distutils \
python3-venv \
python3-pip \
apt-get clean

ARG UID=1000
ARG GID=1000
RUN addgroup --gid $GID ubuntu && \
useradd -d /home/ubuntu -ms /bin/bash -g ubuntu -G sudo -u $UID ubuntu
## Switch to ubuntu user by default.
USER ubuntu

WORKDIR /build
RUN pip3 install -U pip --no-cache-dir && \
pip3 install --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117 --no-cache-dir && \
pip3 install sentence-transformers notebook pytorch-quantization ipywidgets --no-cache-dir

RUN mkdir /syncback
WORKDIR /transformer_deploy

COPY ./setup.py ./setup.py
COPY ./requirements.txt ./requirements.txt
COPY ./requirements_gpu.txt ./requirements_gpu.txt
COPY ./src/__init__.py ./src/__init__.py
COPY ./src/transformer_deploy/__init__.py ./src/transformer_deploy/__init__.py

RUN pip3 install -r requirements.txt && \
pip3 install nvidia-pyindex --no-cache-dir && \
pip3 install -r requirements_gpu.txt

COPY ./ ./
69 changes: 59 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@ First, clone the repo as some commands below expect to find the `demo` folder:
git clone git@github.com:ELS-RD/transformer-deploy.git
cd transformer-deploy
# docker image may take a few minutes
docker pull ghcr.io/els-rd/transformer-deploy:0.5.4
```
docker pull ghcr.io/els-rd/transformer-deploy:0.6.0


### Classification/reranking (encoder model)

Expand All @@ -77,7 +77,7 @@ This will optimize models, generate Triton configuration and Triton folder layou

```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.4 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && \
convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
--backend tensorrt onnx \
Expand Down Expand Up @@ -147,7 +147,7 @@ This will optimize models, generate Triton configuration and Triton folder layou
```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.4 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && \
convert_model -m \"kamalkraj/bert-base-cased-ner-conll2003\" \
--backend tensorrt onnx \
Expand Down Expand Up @@ -212,7 +212,7 @@ This will optimize models, generate Triton configuration and Triton folder layou

```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.4 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && \
convert_model -m \"distilbert-base-cased-distilled-squad\" \
--backend tensorrt onnx \
Expand Down Expand Up @@ -280,7 +280,7 @@ a version >= V2.2.0 of sentence-transformers library.
```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.4 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && \
convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \
--backend tensorrt onnx \
Expand Down Expand Up @@ -330,6 +330,9 @@ curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/version
Text generation seems to be the way to go for NLP.
Unfortunately, they are slow to run, below we will accelerate the most famous of them: GPT-2.
#### GPT example
We will start with GPT-2 model example, then in the next section we will use T5-model.
#### Optimize existing model
Like before, command below will prepare Triton inference server stuff.
Expand All @@ -341,7 +344,7 @@ One point to have in mind is that Triton run:
```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.4 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && \
convert_model -m gpt2 \
--backend tensorrt onnx \
Expand Down Expand Up @@ -371,7 +374,7 @@ To optimize models which typically don't fit twice onto a single GPU, run the sc

```shell
docker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.4 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && \
convert_model -m gpt2-medium \
--backend tensorrt onnx \
Expand Down Expand Up @@ -425,10 +428,56 @@ You may want to tweak it regarding your needs (default is set for greedy search
You may be interested in running optimized text generation on Python directly, without using any inference server:

```shell
docker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.5.4 \
docker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"
```

#### T5-small example
In this section we will present the t5-small model conversion.

#### Optimize existing large model

To optimize model run the script as follows:

```shell
docker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && \
convert_model -m t5-small \
--backend onnx \
--seq-len 16 256 256 \
--task text-generation \
--nb-measures 100 \
--generative-model t5 \
--output triton_models"
```
#### Run Nvidia Triton inference server

To run decoding algorithm server side, we need to install `Pytorch` on `Triton` docker image.

```shell
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \
-v $PWD/triton_models/:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
bash -c "pip install onnx onnxruntime-gpu transformers==4.21.3 git+https://github.com/ELS-RD/transformer-deploy torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html onnx onnxruntime-gpu && \
tritonserver --model-repository=/models"
```
To test text generation, you can try this request:
```shell
curl -X POST http://localhost:8000/v2/models/t5_model_generate/versions/1/infer --data-binary "@demo/generative-model/t5_query_body.bin" --header "Inference-Header-Content-Length: 181"
# output:
# {"model_name":"t5_model_generate","model_version":"1","outputs":[{"name":"OUTPUT_TEXT","datatype":"BYTES","shape":[],"data":["Mein Name mein Wolfgang Wolfgang und ich wohne in Berlin."]}]}
```
#### Query inference

Replace `transformer_onnx_generate` by `transformer_tensorrt_generate` to query `TensorRT` engine.

```shell
curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
--data-binary "@demo/infinity/seq2seq_query_body.bin" \
--header "Inference-Header-Content-Length: 176"
```

### Model quantization on GPU

Quantization is a generic method to get X2 speedup on top of other inference optimization.
Expand All @@ -440,7 +489,7 @@ It makes it easy to use.
To play with it, open this notebook:

```shell
docker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.5.4 \
docker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"
```

Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.5.4
0.6.0
Binary file added demo/generative-model/t5_query_body.bin
Binary file not shown.
50 changes: 50 additions & 0 deletions demo/generative-model/t5_query_gen.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import json
import struct

import requests


text: str = "My name is Wolfgang and I live in Berlin"

context_text: bytes = text.encode("UTF-8")

context_text_struct: bytes = struct.pack("<I", len(context_text)) + context_text

len_context_text_struct = len(context_text_struct)

data_struct = context_text_struct

request_data = {
"inputs": [
{
"name": "TEXT",
"shape": [1],
"datatype": "BYTES",
"parameters": {"binary_data_size": len_context_text_struct},
},
],
"outputs": [{"name": "OUTPUT_TEXT", "parameters": {"binary_data": False}}],
}

data = json.dumps(request_data).encode() + data_struct

print(data)


with open("t5_query_body.bin", "wb") as f:
f.write(data)


curl = f"""
curl -X POST http://localhost:8000/v2/models/t5-dec-if-node_onnx_generate/versions/1/infer \
--data-binary "@demo/generative-model/t5_query_body.bin" \
--header "Inference-Header-Content-Length: {len(json.dumps(request_data).encode())}"
"""
print(curl)


res = requests.post(
url="http://localhost:8000/v2/models/t5-dec-if-node_onnx_generate/versions/1/infer",
data="@demo/generative-model/t5_query_body.bin",
headers={"Inference-Header-Content-Length": len(json.dumps(request_data).encode()).to_bytes(5, "little")},
)
7 changes: 4 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
torch
onnx
onnx==1.13.1
tritonclient[all]
nvidia-pyindex
numpy
numpy==1.23.5
requests
transformers
sentencepiece
Expand All @@ -15,4 +15,5 @@ black[jupyter]
isort
flake8
onnxoptimizer
packaging
packaging
protobuf==3.20.3
2 changes: 1 addition & 1 deletion requirements_cpu.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
onnxruntime==1.12.0
onnxruntime==1.13.1
2 changes: 1 addition & 1 deletion requirements_gpu.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
onnxruntime-gpu==1.12.0
onnxruntime-gpu==1.13.1
nvidia-tensorrt==8.4.1.5
onnx_graphsurgeon
polygraphy
Expand Down
12 changes: 9 additions & 3 deletions src/transformer_deploy/backends/onnx_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,20 +168,26 @@ def merge_autoregressive_model_graphs(model_cache_path: str, model_no_cache_path

# a new input to decide if we use past state or not
enable_cache_input = onnx.helper.make_tensor_value_info(
name="enable_cache", elem_type=onnx.TensorProto.BOOL, shape=[1]
name="enable_cache", elem_type=onnx.TensorProto.INT32, shape=[1]
)

cast_node = onnx.helper.make_node(
"Cast",
inputs=["enable_cache"],
outputs=["bool_enable_cache"],
to=getattr(onnx.TensorProto, "BOOL"),
)
if_node = onnx.helper.make_node(
op_type="If",
inputs=["enable_cache"],
inputs=["bool_enable_cache"],
outputs=[o.name for o in list(model_no_cache.graph.output)],
then_branch=graph_cache,
else_branch=graph_no_cache,
)

# final model which can disable its cache
if_graph_def: onnx.GraphProto = onnx.helper.make_graph(
nodes=[if_node],
nodes=[cast_node, if_node],
name="if-model",
inputs=list(model_cache.graph.input) + [enable_cache_input],
outputs=list(model_no_cache.graph.output),
Expand Down
24 changes: 24 additions & 0 deletions src/transformer_deploy/backends/pytorch_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,30 @@ def infer(inputs: Dict[str, torch.Tensor]) -> Union[torch.Tensor, Tuple[torch.Te
return infer


def infer_text_generation(
model: PreTrainedModel, run_on_cuda: bool, min_length: int, max_length: int, num_beams: int
) -> Callable[[Dict[str, torch.Tensor]], torch.Tensor]:
"""
Perform Pytorch inference for T5 text generation task
:param model: Text generation model
:param run_on_cuda: True if model should run on GPU
:param min_length: minimum text length to be generated
:param max_length: maximum text length to be generated
:param num_beams: number of beams used for text generation
:return: a function to perform inference
"""

def infer(inputs: Dict[str, torch.Tensor]) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
model_outputs = model.generate(
inputs=inputs["input_ids"], min_length=min_length, max_length=max_length, num_beams=num_beams
)
if run_on_cuda:
torch.cuda.synchronize()
return model_outputs

return infer


def infer_feature_extraction_pytorch(
model: PreTrainedModel, run_on_cuda: bool
) -> Callable[[Dict[str, torch.Tensor]], torch.Tensor]:
Expand Down
Loading

0 comments on commit 6b88e24

Please sign in to comment.