Add single command LLM deployment (#3209)

* move start_torchserve from test_utils into ts.launcher * Move register model into launcher * Readd imports to register_model in test_util * Move vllm_handler into ts/torch_handler and add vllm to dependencies * Register vllm_handler in model_archiver * Remove gen_mars from launcher * Add llm_launcher script + llm docker * Use model_path as mode id if path does not exist * Add arguments to llm_launcher * Wait for load command to finish * Optionally skip waiting in launcher.stop * remove custom loading of model archiver * Move llm_launcher to ts * Set model load timeout to 10 min * Finalize dockerfile.llm * Adjust default value of ts launcher for token auth and model api * updated llm_launcher.py * Add llm deployment to readme.md * Added documentation for llm launcher * Added section on supported models * Enable tensor parallelism in llm launcher * Add reference to go beyond quickstart * fix spellcheck lint * HPC->HPU * doc * Move margen import below path changes * Fix java formatting * Remove gen_mar kw * Fix error if model_store is used as positional argument * Remove .queue
pytorch · Jun 28, 2024 · 160bee7 · 160bee7
1 parent affdcdd
commit 160bee7
Show file tree

Hide file tree

Showing 17 changed files with 427 additions and 104 deletions.
diff --git a/README.md b/README.md
@@ -56,6 +56,19 @@ docker pull pytorch/torchserve-nightly
 
 Refer to [torchserve docker](docker/README.md) for details.
 
+### 🤖 Quick Start LLM Deployment
+
+```bash
+#export token=<HUGGINGFACE_HUB_TOKEN>
+docker build . -f docker/Dockerfile.llm -t ts/llm
+
+docker run --rm -ti --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/llm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token
+
+curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
+```
+
+Refer to [LLM deployment][docs/llm_deployment.md] for details and other methods.
+
 ## ⚡ Why TorchServe
 * Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, [Nvidia MPS](docs/nvidia_mps.md)
 * [Model Management API](docs/management_api.md): multi model management with optimized worker to model allocation

diff --git a/docker/Dockerfile.llm b/docker/Dockerfile.llm
@@ -0,0 +1,9 @@
+FROM pytorch/torchserve-nightly:latest-gpu as server
+
+USER root
+
+RUN mkdir /data && chown -R model-server /data
+
+USER model-server
+
+ENTRYPOINT [ "python", "-m", "ts.llm_launcher", "--vllm_engine.download_dir", "/data" ]
diff --git a/docs/README.md b/docs/README.md
@@ -32,6 +32,7 @@ TorchServe is a performant, flexible and easy to use tool for serving PyTorch ea
 
 ## Examples
 
+* [Deploying LLMs](./llm_deployment.md) - How to easily deploy LLMs using TorchServe
 * [HuggingFace Language Model](https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py) - This handler takes an input sentence and can return sequence classifications, token classifications or Q&A answers
 * [Multi Modal Framework](https://github.com/pytorch/serve/blob/master/examples/MMF-activity-recognition/handler.py) - Build and deploy a classifier that combines text, audio and video input data
 * [Dual Translation Workflow](https://github.com/pytorch/serve/tree/master/examples/Workflows/nmt_transformers_pipeline) -

diff --git a/docs/llm_deployment.md b/docs/llm_deployment.md
@@ -0,0 +1,73 @@
+# LLM Deployment with TorchServe
+
+This document describes how to easily serve large language models (LLM) like Meta-Llama3 with TorchServe.
+Besides a quick start guide using our VLLM integration we also provide a list of examples which describe other methods to deploy LLMs with TorchServe.
+
+## Quickstart LLM Deployment
+
+TorchServe offers easy LLM deployment through its VLLM integration.
+Through the integration of our [LLM launcher script](https://github.com/pytorch/serve/blob/7a9b145204b4d7cfbb114fe737cf980221e6181e/ts/llm_launcher.py) users are able to deploy any model supported by VLLM with a single command.
+The launcher can either be used standalone or in combination with our provided TorchServe GPU docker image.
+
+To launch the docker we first need to build it:
+```bash
+docker build . -f docker/Dockerfile.llm -t ts/llm
+```
+
+Models are usually loaded from the HuggingFace hub and are cached in a [docker volume](https://docs.docker.com/storage/volumes/) for faster reload.
+If you want to access gated models like the Meta-Llama3 model you need to provide a HuggingFace hub token:
+```bash
+export token=<HUGGINGFACE_HUB_TOKEN>
+```
+
+You can then go ahead and launch a TorchServe instance serving your selected model:
+```bash
+docker run --rm -ti --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/llm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token
+```
+
+To change the model you just need to exchange the identifier given to the `--model_id` parameter.
+You can test the model with:
+```bash
+curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
+```
+
+You can change any of the sampling argument for the request by using the [VLLM SamplingParams keywords](https://docs.vllm.ai/en/stable/dev/sampling_params.html#vllm.SamplingParams).
+E.g. for setting the sampling temperature to 0 we can do:
+```bash
+curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50, "temperature": 0}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
+```
+
+TorchServe's LLM launcher scripts offers some customization options as well.
+To rename the model endpoint from `predictions/model` to something else you can add `--model_name <SOME_NAME>` to the `docker run` command.
+
+The launcher script can also be used outside a docker container by calling this after installing TorchServe following the [installation instruction](https://github.com/pytorch/serve/blob/feature/single_cmd_llm_deployment/README.md#-quick-start-with-torchserve).
+```bash
+python -m ts.llm_launcher --disable_token
+```
+
+Please note that the launcher script as well as the docker command will automatically run on all available GPUs so make sure to restrict the visible number of device by setting CUDA_VISIBLE_DEVICES.
+
+For further customization of the handler and adding 3rd party dependencies you can have a look at out [VLLM example](https://github.com/pytorch/serve/tree/master/examples/large_models/vllm).
+
+## Supported models
+The quickstart launcher should allow to launch any model which is [supported by VLLM](https://docs.vllm.ai/en/latest/models/supported_models.html).
+Here is a list of model identifiers tested by the TorchServe team:
+
+* meta-llama/Meta-Llama-3-8B
+* meta-llama/Meta-Llama-3-8B-Instruct
+* meta-llama/Llama-2-7b-hf
+* meta-llama/Llama-2-7b-chat-hf
+* mistralai/Mistral-7B-v0.1
+* mistralai/Mistral-7B-Instruct-v0.1
+
+## Other ways to deploy LLMs with TorchServe
+
+TorchServe offers a variety of example on how to deploy large models.
+Here is a list of the current examples:
+
+* [Llama 2/3 chat bot](https://github.com/pytorch/serve/tree/master/examples/LLM/llama)
+* [GPT-fast](https://github.com/pytorch/serve/tree/master/examples/large_models/gpt_fast)
+* [Inferentia2](https://github.com/pytorch/serve/tree/master/examples/large_models/inferentia2)
+* [IPEX optimized](https://github.com/pytorch/serve/tree/master/examples/large_models/ipex_llm_int8)
+* [Tensor Parallel Llama](https://github.com/pytorch/serve/tree/master/examples/large_models/tp_llama)
+* [VLLM Integration](https://github.com/pytorch/serve/tree/master/examples/large_models/vllm)
diff --git a/examples/large_models/vllm/config.properties b/examples/large_models/vllm/config.properties
@@ -2,4 +2,3 @@ inference_address=http://127.0.0.1:8080
 management_address=http://127.0.0.1:8081
 metrics_address=http://127.0.0.1:8082
 enable_envvars_config=true
-install_py_dep_per_model=true
diff --git a/examples/large_models/vllm/llama3/Readme.md b/examples/large_models/vllm/llama3/Readme.md
@@ -21,7 +21,7 @@ python ../../utils/Download_model.py --model_path model --model_name meta-llama/
 Add the downloaded path to "model_path:" in `model-config.yaml` and run the following.
 
 ```bash
-torch-model-archiver --model-name llama3-8b --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive
+torch-model-archiver --model-name llama3-8b --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
 mv model llama3-8b
 ```
 

diff --git a/examples/large_models/vllm/lora/Readme.md b/examples/large_models/vllm/lora/Readme.md
@@ -24,7 +24,7 @@ cd ..
 Add the downloaded path to "model_path:" and "adapter_1:" in `model-config.yaml` and run the following.
 
 ```bash
-torch-model-archiver --model-name llama-7b-lora --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive
+torch-model-archiver --model-name llama-7b-lora --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
 mv model llama-7b-lora
 mv adapters llama-7b-lora
 ```

diff --git a/examples/large_models/vllm/mistral/Readme.md b/examples/large_models/vllm/mistral/Readme.md
@@ -21,7 +21,7 @@ python ../../utils/Download_model.py --model_path model --model_name mistralai/M
 Add the downloaded path to "model_path:" in `model-config.yaml` and run the following.
 
 ```bash
-torch-model-archiver --model-name mistral --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive
+torch-model-archiver --model-name mistral --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
 mv model mistral
 ```
 

diff --git a/examples/large_models/vllm/requirements.txt b/examples/large_models/vllm/requirements.txt
diff --git a/frontend/server/src/main/java/org/pytorch/serve/wlm/AsyncWorkerThread.java b/frontend/server/src/main/java/org/pytorch/serve/wlm/AsyncWorkerThread.java
@@ -33,8 +33,10 @@
 public class AsyncWorkerThread extends WorkerThread {
     // protected ConcurrentHashMap requestsInBackend;
     protected static final Logger logger = LoggerFactory.getLogger(AsyncWorkerThread.class);
+    protected static final long MODEL_LOAD_TIMEOUT = 10L;
 
     protected boolean loadingFinished;
+    protected CountDownLatch latch;
 
     public AsyncWorkerThread(
             ConfigManager configManager,
@@ -75,6 +77,17 @@ public void run() {
                 try {
                     backendChannel.get(0).writeAndFlush(req).sync();
                     logger.debug("Successfully flushed req");
+
+                    if (loadingFinished == false) {
+                        latch = new CountDownLatch(1);
+                        if (!latch.await(MODEL_LOAD_TIMEOUT, TimeUnit.MINUTES)) {
+                            throw new WorkerInitializationException(
+                                    "Worker did not load the model within"
+                                            + MODEL_LOAD_TIMEOUT
+                                            + " mins");
+                        }
+                    }
+
                 } catch (InterruptedException e) {
                     logger.error("Failed to send request to backend", e);
                 }
@@ -240,6 +253,7 @@ public void channelRead0(ChannelHandlerContext ctx, ModelWorkerResponse msg) {
                         setState(WorkerState.WORKER_MODEL_LOADED, HttpURLConnection.HTTP_OK);
                         backoffIdx = 0;
                         loadingFinished = true;
+                        latch.countDown();
                     } else {
                         setState(WorkerState.WORKER_ERROR, msg.getCode());
                     }

diff --git a/model-archiver/model_archiver/model_packaging_utils.py b/model-archiver/model_archiver/model_packaging_utils.py
@@ -34,6 +34,7 @@
     "object_detector": "vision",
     "image_segmenter": "vision",
     "dali_image_classifier": "vision",
+    "vllm_handler": "text",
 }
 
 MODEL_SERVER_VERSION = "1.0"

diff --git a/requirements/torch_linux.txt b/requirements/torch_linux.txt
@@ -5,3 +5,4 @@ torch==2.3.0+cpu; sys_platform == 'linux'
 torchvision==0.18.0+cpu; sys_platform == 'linux'
 torchtext==0.18.0; sys_platform == 'linux'
 torchaudio==2.3.0+cpu; sys_platform == 'linux'
+vllm==0.5.0; sys_platform == 'linux'
diff --git a/test/pytest/test_utils.py b/test/pytest/test_utils.py
@@ -5,103 +5,25 @@
 import subprocess
 import sys
 import tempfile
-import threading
-from io import TextIOWrapper
 from os import path
 from pathlib import Path
-from queue import Queue
-from subprocess import PIPE, STDOUT, Popen
 
 import orjson
 import requests
 
 # To help discover margen modules
 REPO_ROOT = os.path.join(os.path.dirname(os.path.abspath(__file__)), "../../")
 sys.path.append(REPO_ROOT)
+
+from ts.launcher import register_model, register_model_with_params, start  # noqa
+from ts.launcher import stop as stop_torchserve
 from ts_scripts import marsgen as mg
 
 ROOT_DIR = os.path.join(tempfile.gettempdir(), "workspace")
 MODEL_STORE = path.join(ROOT_DIR, "model_store/")
 CODEBUILD_WD = path.abspath(path.join(__file__, "../../.."))
 
 
-class PrintTillTheEnd(threading.Thread):
-    def __init__(self, queue):
-        super().__init__()
-        self._queue = queue
-
-    def run(self):
-        while True:
-            line = self._queue.get()
-            if not line:
-                break
-            print(line.strip())
-
-
-class Tee(threading.Thread):
-    def __init__(self, reader):
-        super().__init__()
-        self.reader = reader
-        self.queue1 = Queue()
-        self.queue2 = Queue()
-
-    def run(self):
-        for line in self.reader:
-            self.queue1.put(line)
-            self.queue2.put(line)
-        self.queue1.put(None)
-        self.queue2.put(None)
-
-
-def start_torchserve(
-    model_store=None,
-    snapshot_file=None,
-    no_config_snapshots=False,
-    gen_mar=True,
-    plugin_folder=None,
-    disable_token=True,
-    models=None,
-    model_api_enabled=True,
-):
-    stop_torchserve()
-    crate_mar_file_table()
-    cmd = ["torchserve", "--start"]
-    model_store = model_store if model_store else MODEL_STORE
-    if gen_mar:
-        mg.gen_mar(model_store)
-    cmd.extend(["--model-store", model_store])
-    if plugin_folder:
-        cmd.extend(["--plugins-path", plugin_folder])
-    if snapshot_file:
-        cmd.extend(["--ts-config", snapshot_file])
-    if no_config_snapshots:
-        cmd.extend(["--no-config-snapshots"])
-    if disable_token:
-        cmd.append("--disable-token")
-    if models:
-        cmd.extend(["--models", models])
-    if model_api_enabled:
-        cmd.extend(["--model-api-enabled"])
-    print(cmd)
-
-    p = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=STDOUT)
-    for line in p.stdout:
-        print(line.decode("utf8").strip())
-        if "Model server started" in str(line).strip():
-            break
-
-    splitter = Tee(TextIOWrapper(p.stdout))
-    splitter.start()
-    print_thread = PrintTillTheEnd(splitter.queue1)
-    print_thread.start()
-
-    return splitter.queue2
-
-
-def stop_torchserve():
-    subprocess.run(["torchserve", "--stop", "--foreground"])
-
-
 def delete_all_snapshots():
     for f in glob.glob("logs/config/*"):
         os.remove(f)
@@ -115,27 +37,26 @@ def delete_model_store(model_store=None):
         os.remove(f)
 
 
+def start_torchserve(*args, **kwargs):
+    create_mar_file_table()
+    # In case someone uses model_store as positional argument
+    if len(args) == 0:
+        kwargs.update({"model_store": kwargs.get("model_store", MODEL_STORE)})
+    if kwargs.get("gen_mar", True):
+        mg.gen_mar(kwargs.get("model_store"))
+    if "gen_mar" in kwargs:
+        del kwargs["gen_mar"]
+    kwargs.update({"disable_token": kwargs.get("disable_token", True)})
+    kwargs.update({"model_api_enabled": kwargs.get("model_api_enabled", True)})
+    return start(*args, **kwargs)
+
+
 def torchserve_cleanup():
     stop_torchserve()
     delete_model_store()
     delete_all_snapshots()
 
 
-def register_model(model_name, url):
-    params = (
-        ("model_name", model_name),
-        ("url", url),
-        ("initial_workers", "1"),
-        ("synchronous", "true"),
-    )
-    return register_model_with_params(params)
-
-
-def register_model_with_params(params):
-    response = requests.post("http://localhost:8081/models", params=params)
-    return response
-
-
 def unregister_model(model_name):
     response = requests.delete("http://localhost:8081/models/{}".format(model_name))
     return response
@@ -163,7 +84,7 @@ def delete_mar_file_from_model_store(model_store=None, model_mar=None):
 mar_file_table = {}
 
 
-def crate_mar_file_table():
+def create_mar_file_table():
     if not mar_file_table:
         with open(
             os.path.join(os.path.dirname(__file__), *environment_json.split("/")), "rb"