Skip to content

Commit

Permalink
Updating examples for security tags (#3224)
Browse files Browse the repository at this point in the history
* updating examples

* examples added

* spellcheck addition
  • Loading branch information
udaij12 committed Jul 3, 2024
1 parent 55bcf9d commit 589b351
Show file tree
Hide file tree
Showing 58 changed files with 201 additions and 174 deletions.
22 changes: 11 additions & 11 deletions examples/FasterTransformer_HuggingFace_Bert/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
## Faster Transformer
## Faster Transformer

Batch inferencing with Transformers faces two challenges

- Large batch sizes suffer from higher latency and small or medium-sized batches this will become kernel latency launch bound.
- Large batch sizes suffer from higher latency and small or medium-sized batches this will become kernel latency launch bound.
- Padding wastes a lot of compute, (batchsize, seq_length) requires to pad the sequence to (batchsize, max_length) where difference between avg_length and max_length results in a considerable waste of computation, increasing the batch size worsen this situation.

[Faster Transformers](https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/bert/run_glue.py) (FT) from Nvidia along with [Efficient Transformers](https://github.com/bytedance/effective_transformer) (EFFT) that is built on top of FT address the above two challenges, by fusing the CUDA kernels and dynamically removing padding during computations. The current implementation from [Faster Transformers](https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/bert/run_glue.py) support BERT like encoder and decoder layers. In this example, we show how to get a Torchscripted (traced) EFFT variant of Bert models from HuggingFace (HF) for sequence classification and question answering and serve it.


### How to get a Torchscripted (Traced) EFFT of HF Bert model and serving it

**Requirements**
**Requirements**

Running Faster Transformer at this point is recommended through [NVIDIA docker and NGC container](https://github.com/NVIDIA/FasterTransformer#requirements), also it requires [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) or [Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) based GPU. For this example we have used a **g4dn.2xlarge** EC2 instance that has a T4 GPU.

Expand All @@ -34,9 +34,9 @@ mkdir -p build

cd build

cmake -DSM=75 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON .. # -DSM = 70 for V100 gpu ------- 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100),
cmake -DSM=75 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON .. # -DSM = 70 for V100 gpu ------- 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100),

make
make

pip install transformers==2.5.1

Expand All @@ -45,8 +45,8 @@ cd /workspace
# clone Torchserve to access examples
git clone https://github.com/pytorch/serve.git

# install torchserve
cd serve
# install torchserve
cd serve

pip install -r requirements/common.txt

Expand Down Expand Up @@ -99,7 +99,7 @@ mkdir model_store

mv BERTSeqClassification.mar model_store/

torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --ncs
torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --ncs --disable-token-auth --enable-model-api

curl -X POST http://127.0.0.1:8080/predictions/my_tc -T ../Huggingface_Transformers/Seq_classification_artifacts/sample_text_captum_input.txt

Expand Down Expand Up @@ -132,7 +132,7 @@ cd /workspace/FasterTransformer/build/
# --data_type can be fp16 or fp32
python pytorch/Bert_FT_trace.py --mode question_answering --model_name_or_path "/workspace/serve/Transformer_model" --tokenizer_name "bert-base-uncased" --batch_size 1 --data_type fp16 --model_type thsext

cd -
cd -

# make sure to change the ../Huggingface_Transformers/setup_config.json "save_mode":"torchscript"

Expand All @@ -142,10 +142,10 @@ mkdir model_store

mv BERTQA.mar model_store/

torchserve --start --model-store model_store --models my_tc=BERTQA.mar --ncs
torchserve --start --model-store model_store --models my_tc=BERTQA.mar --ncs --disable-token-auth --enable-model-api

curl -X POST http://127.0.0.1:8080/predictions/my_tc -T ../Huggingface_Transformers/QA_artifacts/sample_text_captum_input.txt

```

####
####
14 changes: 7 additions & 7 deletions examples/Huggingface_Transformers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ To register the model on TorchServe using the above model archive file, we run t
```
mkdir model_store
mv BERTSeqClassification.mar model_store/
torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --disable-token --ncs
torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --disable-token --ncs --disable-token-auth --enable-model-api
```

Expand Down Expand Up @@ -164,7 +164,7 @@ torch-model-archiver --model-name BERTTokenClassification --version 1.0 --serial
```
mkdir model_store
mv BERTTokenClassification.mar model_store
torchserve --start --model-store model_store --models my_tc=BERTTokenClassification.mar --disable-token --ncs
torchserve --start --model-store model_store --models my_tc=BERTTokenClassification.mar --disable-token --ncs --disable-token-auth --enable-model-api
```

### Run an inference
Expand Down Expand Up @@ -208,7 +208,7 @@ torch-model-archiver --model-name BERTQA --version 1.0 --serialized-file Transfo
```
mkdir model_store
mv BERTQA.mar model_store
torchserve --start --model-store model_store --models my_tc=BERTQA.mar --disable-token --ncs
torchserve --start --model-store model_store --models my_tc=BERTQA.mar --disable-token --ncs --disable-token-auth --enable-model-api
```
### Run an inference
To run an inference: `curl -X POST http://127.0.0.1:8080/predictions/my_tc -T QA_artifacts/sample_text_captum_input.txt`
Expand Down Expand Up @@ -255,7 +255,7 @@ To register the model on TorchServe using the above model archive file, we run t
```
mkdir model_store
mv Textgeneration.mar model_store/
torchserve --start --model-store model_store --models my_tc=Textgeneration.mar --disable-token --ncs
torchserve --start --model-store model_store --models my_tc=Textgeneration.mar --disable-token --ncs --disable-token-auth --enable-model-api
```

### Run an inference
Expand All @@ -272,7 +272,7 @@ For batch inference the main difference is that you need set the batch size whil
```
mkdir model_store
mv BERTSeqClassification.mar model_store/
torchserve --start --model-store model_store --disable-token --ncs
torchserve --start --model-store model_store --disable-token --ncs --disable-token-auth --enable-model-api
curl -X POST "localhost:8081/models?model_name=BERTSeqClassification&url=BERTSeqClassification.mar&batch_size=4&max_batch_delay=5000&initial_workers=3&synchronous=true"
```
Expand All @@ -297,7 +297,7 @@ For batch inference the main difference is that you need set the batch size whil
```
mkdir model_store
mv BERTSeqClassification.mar model_store/
torchserve --start --model-store model_store --ts-config config.properties --models BERTSeqClassification= BERTSeqClassification.mar
torchserve --start --model-store model_store --ts-config config.properties --models BERTSeqClassification= BERTSeqClassification.mar --disable-token-auth --enable-model-api
```
Now to run the batch inference following command can be used:
Expand Down Expand Up @@ -377,7 +377,7 @@ To register the model on TorchServe using the above model archive file, we run t
```
mkdir model_store
mv Textgeneration.mar model_store/
torchserve --start --model-store model_store --disable-token
torchserve --start --model-store model_store --disable-token --disable-token-auth --enable-model-api
curl -X POST "localhost:8081/models?model_name=Textgeneration&url=Textgeneration.mar&batch_size=1&max_batch_delay=5000&initial_workers=1&synchronous=true"
```

Expand Down
4 changes: 3 additions & 1 deletion examples/LLM/llama/chat_app/torchserve_server_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@


def start_server():
os.system("torchserve --start --model-store model_store --ncs")
os.system(
"torchserve --start --model-store model_store --ncs --disable-token-auth --enable-model-api"
)
st.session_state.started = True
st.session_state.stopped = False
st.session_state.registered = False
Expand Down
Loading

0 comments on commit 589b351

Please sign in to comment.