Skip to content

Commit

Permalink
Add single command LLM deployment (#3209)
Browse files Browse the repository at this point in the history
* move start_torchserve from test_utils into ts.launcher

* Move register model into launcher

* Readd imports to register_model in test_util

* Move vllm_handler into ts/torch_handler and add vllm to dependencies

* Register vllm_handler in model_archiver

* Remove gen_mars from launcher

* Add llm_launcher script + llm docker

* Use model_path as mode id if path does not exist

* Add arguments to llm_launcher

* Wait for load command to finish

* Optionally skip waiting in launcher.stop

* remove custom loading of  model archiver

* Move llm_launcher to ts

* Set model load timeout to 10 min

* Finalize dockerfile.llm

* Adjust default value of ts launcher for token auth and model api

* updated llm_launcher.py

* Add llm deployment to readme.md

* Added documentation for llm launcher

* Added section on supported models

* Enable tensor parallelism in llm launcher

* Add reference to go beyond quickstart

* fix spellcheck lint

* HPC->HPU

* doc

* Move margen import below path changes

* Fix java formatting

* Remove gen_mar kw

* Fix error if model_store is used as positional argument

* Remove .queue
  • Loading branch information
mreso committed Jun 28, 2024
1 parent affdcdd commit 160bee7
Show file tree
Hide file tree
Showing 17 changed files with 427 additions and 104 deletions.
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,19 @@ docker pull pytorch/torchserve-nightly

Refer to [torchserve docker](docker/README.md) for details.

### 🤖 Quick Start LLM Deployment

```bash
#export token=<HUGGINGFACE_HUB_TOKEN>
docker build . -f docker/Dockerfile.llm -t ts/llm

docker run --rm -ti --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/llm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token

curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
```

Refer to [LLM deployment][docs/llm_deployment.md] for details and other methods.

## ⚡ Why TorchServe
* Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, [Nvidia MPS](docs/nvidia_mps.md)
* [Model Management API](docs/management_api.md): multi model management with optimized worker to model allocation
Expand Down
9 changes: 9 additions & 0 deletions docker/Dockerfile.llm
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM pytorch/torchserve-nightly:latest-gpu as server

USER root

RUN mkdir /data && chown -R model-server /data

USER model-server

ENTRYPOINT [ "python", "-m", "ts.llm_launcher", "--vllm_engine.download_dir", "/data" ]
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ TorchServe is a performant, flexible and easy to use tool for serving PyTorch ea

## Examples

* [Deploying LLMs](./llm_deployment.md) - How to easily deploy LLMs using TorchServe
* [HuggingFace Language Model](https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py) - This handler takes an input sentence and can return sequence classifications, token classifications or Q&A answers
* [Multi Modal Framework](https://github.com/pytorch/serve/blob/master/examples/MMF-activity-recognition/handler.py) - Build and deploy a classifier that combines text, audio and video input data
* [Dual Translation Workflow](https://github.com/pytorch/serve/tree/master/examples/Workflows/nmt_transformers_pipeline) -
Expand Down
73 changes: 73 additions & 0 deletions docs/llm_deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# LLM Deployment with TorchServe

This document describes how to easily serve large language models (LLM) like Meta-Llama3 with TorchServe.
Besides a quick start guide using our VLLM integration we also provide a list of examples which describe other methods to deploy LLMs with TorchServe.

## Quickstart LLM Deployment

TorchServe offers easy LLM deployment through its VLLM integration.
Through the integration of our [LLM launcher script](https://github.com/pytorch/serve/blob/7a9b145204b4d7cfbb114fe737cf980221e6181e/ts/llm_launcher.py) users are able to deploy any model supported by VLLM with a single command.
The launcher can either be used standalone or in combination with our provided TorchServe GPU docker image.

To launch the docker we first need to build it:
```bash
docker build . -f docker/Dockerfile.llm -t ts/llm
```

Models are usually loaded from the HuggingFace hub and are cached in a [docker volume](https://docs.docker.com/storage/volumes/) for faster reload.
If you want to access gated models like the Meta-Llama3 model you need to provide a HuggingFace hub token:
```bash
export token=<HUGGINGFACE_HUB_TOKEN>
```

You can then go ahead and launch a TorchServe instance serving your selected model:
```bash
docker run --rm -ti --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/llm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token
```

To change the model you just need to exchange the identifier given to the `--model_id` parameter.
You can test the model with:
```bash
curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
```

You can change any of the sampling argument for the request by using the [VLLM SamplingParams keywords](https://docs.vllm.ai/en/stable/dev/sampling_params.html#vllm.SamplingParams).
E.g. for setting the sampling temperature to 0 we can do:
```bash
curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50, "temperature": 0}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
```

TorchServe's LLM launcher scripts offers some customization options as well.
To rename the model endpoint from `predictions/model` to something else you can add `--model_name <SOME_NAME>` to the `docker run` command.

The launcher script can also be used outside a docker container by calling this after installing TorchServe following the [installation instruction](https://github.com/pytorch/serve/blob/feature/single_cmd_llm_deployment/README.md#-quick-start-with-torchserve).
```bash
python -m ts.llm_launcher --disable_token
```

Please note that the launcher script as well as the docker command will automatically run on all available GPUs so make sure to restrict the visible number of device by setting CUDA_VISIBLE_DEVICES.

For further customization of the handler and adding 3rd party dependencies you can have a look at out [VLLM example](https://github.com/pytorch/serve/tree/master/examples/large_models/vllm).

## Supported models
The quickstart launcher should allow to launch any model which is [supported by VLLM](https://docs.vllm.ai/en/latest/models/supported_models.html).
Here is a list of model identifiers tested by the TorchServe team:

* meta-llama/Meta-Llama-3-8B
* meta-llama/Meta-Llama-3-8B-Instruct
* meta-llama/Llama-2-7b-hf
* meta-llama/Llama-2-7b-chat-hf
* mistralai/Mistral-7B-v0.1
* mistralai/Mistral-7B-Instruct-v0.1

## Other ways to deploy LLMs with TorchServe

TorchServe offers a variety of example on how to deploy large models.
Here is a list of the current examples:

* [Llama 2/3 chat bot](https://github.com/pytorch/serve/tree/master/examples/LLM/llama)
* [GPT-fast](https://github.com/pytorch/serve/tree/master/examples/large_models/gpt_fast)
* [Inferentia2](https://github.com/pytorch/serve/tree/master/examples/large_models/inferentia2)
* [IPEX optimized](https://github.com/pytorch/serve/tree/master/examples/large_models/ipex_llm_int8)
* [Tensor Parallel Llama](https://github.com/pytorch/serve/tree/master/examples/large_models/tp_llama)
* [VLLM Integration](https://github.com/pytorch/serve/tree/master/examples/large_models/vllm)
1 change: 0 additions & 1 deletion examples/large_models/vllm/config.properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,3 @@ inference_address=http://127.0.0.1:8080
management_address=http://127.0.0.1:8081
metrics_address=http://127.0.0.1:8082
enable_envvars_config=true
install_py_dep_per_model=true
2 changes: 1 addition & 1 deletion examples/large_models/vllm/llama3/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ python ../../utils/Download_model.py --model_path model --model_name meta-llama/
Add the downloaded path to "model_path:" in `model-config.yaml` and run the following.

```bash
torch-model-archiver --model-name llama3-8b --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive
torch-model-archiver --model-name llama3-8b --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
mv model llama3-8b
```

Expand Down
2 changes: 1 addition & 1 deletion examples/large_models/vllm/lora/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ cd ..
Add the downloaded path to "model_path:" and "adapter_1:" in `model-config.yaml` and run the following.

```bash
torch-model-archiver --model-name llama-7b-lora --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive
torch-model-archiver --model-name llama-7b-lora --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
mv model llama-7b-lora
mv adapters llama-7b-lora
```
Expand Down
2 changes: 1 addition & 1 deletion examples/large_models/vllm/mistral/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ python ../../utils/Download_model.py --model_path model --model_name mistralai/M
Add the downloaded path to "model_path:" in `model-config.yaml` and run the following.

```bash
torch-model-archiver --model-name mistral --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive
torch-model-archiver --model-name mistral --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
mv model mistral
```

Expand Down
1 change: 0 additions & 1 deletion examples/large_models/vllm/requirements.txt

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,10 @@
public class AsyncWorkerThread extends WorkerThread {
// protected ConcurrentHashMap requestsInBackend;
protected static final Logger logger = LoggerFactory.getLogger(AsyncWorkerThread.class);
protected static final long MODEL_LOAD_TIMEOUT = 10L;

protected boolean loadingFinished;
protected CountDownLatch latch;

public AsyncWorkerThread(
ConfigManager configManager,
Expand Down Expand Up @@ -75,6 +77,17 @@ public void run() {
try {
backendChannel.get(0).writeAndFlush(req).sync();
logger.debug("Successfully flushed req");

if (loadingFinished == false) {
latch = new CountDownLatch(1);
if (!latch.await(MODEL_LOAD_TIMEOUT, TimeUnit.MINUTES)) {
throw new WorkerInitializationException(
"Worker did not load the model within"
+ MODEL_LOAD_TIMEOUT
+ " mins");
}
}

} catch (InterruptedException e) {
logger.error("Failed to send request to backend", e);
}
Expand Down Expand Up @@ -240,6 +253,7 @@ public void channelRead0(ChannelHandlerContext ctx, ModelWorkerResponse msg) {
setState(WorkerState.WORKER_MODEL_LOADED, HttpURLConnection.HTTP_OK);
backoffIdx = 0;
loadingFinished = true;
latch.countDown();
} else {
setState(WorkerState.WORKER_ERROR, msg.getCode());
}
Expand Down
1 change: 1 addition & 0 deletions model-archiver/model_archiver/model_packaging_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
"object_detector": "vision",
"image_segmenter": "vision",
"dali_image_classifier": "vision",
"vllm_handler": "text",
}

MODEL_SERVER_VERSION = "1.0"
Expand Down
1 change: 1 addition & 0 deletions requirements/torch_linux.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ torch==2.3.0+cpu; sys_platform == 'linux'
torchvision==0.18.0+cpu; sys_platform == 'linux'
torchtext==0.18.0; sys_platform == 'linux'
torchaudio==2.3.0+cpu; sys_platform == 'linux'
vllm==0.5.0; sys_platform == 'linux'
115 changes: 18 additions & 97 deletions test/pytest/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,103 +5,25 @@
import subprocess
import sys
import tempfile
import threading
from io import TextIOWrapper
from os import path
from pathlib import Path
from queue import Queue
from subprocess import PIPE, STDOUT, Popen

import orjson
import requests

# To help discover margen modules
REPO_ROOT = os.path.join(os.path.dirname(os.path.abspath(__file__)), "../../")
sys.path.append(REPO_ROOT)

from ts.launcher import register_model, register_model_with_params, start # noqa
from ts.launcher import stop as stop_torchserve
from ts_scripts import marsgen as mg

ROOT_DIR = os.path.join(tempfile.gettempdir(), "workspace")
MODEL_STORE = path.join(ROOT_DIR, "model_store/")
CODEBUILD_WD = path.abspath(path.join(__file__, "../../.."))


class PrintTillTheEnd(threading.Thread):
def __init__(self, queue):
super().__init__()
self._queue = queue

def run(self):
while True:
line = self._queue.get()
if not line:
break
print(line.strip())


class Tee(threading.Thread):
def __init__(self, reader):
super().__init__()
self.reader = reader
self.queue1 = Queue()
self.queue2 = Queue()

def run(self):
for line in self.reader:
self.queue1.put(line)
self.queue2.put(line)
self.queue1.put(None)
self.queue2.put(None)


def start_torchserve(
model_store=None,
snapshot_file=None,
no_config_snapshots=False,
gen_mar=True,
plugin_folder=None,
disable_token=True,
models=None,
model_api_enabled=True,
):
stop_torchserve()
crate_mar_file_table()
cmd = ["torchserve", "--start"]
model_store = model_store if model_store else MODEL_STORE
if gen_mar:
mg.gen_mar(model_store)
cmd.extend(["--model-store", model_store])
if plugin_folder:
cmd.extend(["--plugins-path", plugin_folder])
if snapshot_file:
cmd.extend(["--ts-config", snapshot_file])
if no_config_snapshots:
cmd.extend(["--no-config-snapshots"])
if disable_token:
cmd.append("--disable-token")
if models:
cmd.extend(["--models", models])
if model_api_enabled:
cmd.extend(["--model-api-enabled"])
print(cmd)

p = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=STDOUT)
for line in p.stdout:
print(line.decode("utf8").strip())
if "Model server started" in str(line).strip():
break

splitter = Tee(TextIOWrapper(p.stdout))
splitter.start()
print_thread = PrintTillTheEnd(splitter.queue1)
print_thread.start()

return splitter.queue2


def stop_torchserve():
subprocess.run(["torchserve", "--stop", "--foreground"])


def delete_all_snapshots():
for f in glob.glob("logs/config/*"):
os.remove(f)
Expand All @@ -115,27 +37,26 @@ def delete_model_store(model_store=None):
os.remove(f)


def start_torchserve(*args, **kwargs):
create_mar_file_table()
# In case someone uses model_store as positional argument
if len(args) == 0:
kwargs.update({"model_store": kwargs.get("model_store", MODEL_STORE)})
if kwargs.get("gen_mar", True):
mg.gen_mar(kwargs.get("model_store"))
if "gen_mar" in kwargs:
del kwargs["gen_mar"]
kwargs.update({"disable_token": kwargs.get("disable_token", True)})
kwargs.update({"model_api_enabled": kwargs.get("model_api_enabled", True)})
return start(*args, **kwargs)


def torchserve_cleanup():
stop_torchserve()
delete_model_store()
delete_all_snapshots()


def register_model(model_name, url):
params = (
("model_name", model_name),
("url", url),
("initial_workers", "1"),
("synchronous", "true"),
)
return register_model_with_params(params)


def register_model_with_params(params):
response = requests.post("http://localhost:8081/models", params=params)
return response


def unregister_model(model_name):
response = requests.delete("http://localhost:8081/models/{}".format(model_name))
return response
Expand Down Expand Up @@ -163,7 +84,7 @@ def delete_mar_file_from_model_store(model_store=None, model_mar=None):
mar_file_table = {}


def crate_mar_file_table():
def create_mar_file_table():
if not mar_file_table:
with open(
os.path.join(os.path.dirname(__file__), *environment_json.split("/")), "rb"
Expand Down
Loading

0 comments on commit 160bee7

Please sign in to comment.