SeldonIO · RobertSamoilescu · Jun 5, 2024 · Jun 3, 2024 · Jun 3, 2024 · Jun 3, 2024
diff --git a/docs/examples/index.md b/docs/examples/index.md
@@ -49,6 +49,7 @@ serving), check out the examples below.
 - [Custom Conda environment](./conda/README.md)
 - [Serving custom models requiring JSON inputs or outputs](./custom-json/README.md)
 - [Serving models through Kafka](./kafka/README.md)
+- [Streaming inference](./streaming/README.md)
 
 ```{toctree}
 :caption: MLServer Features
@@ -61,6 +62,7 @@ serving), check out the examples below.
 ./conda/README.md
 ./custom-json/README.md
 ./kafka/README.md
+./streaming/README.md
 ```
 
 ## Tutorials

diff --git a/docs/examples/streaming/README.ipynb b/docs/examples/streaming/README.ipynb
@@ -0,0 +1,369 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Streaming support\n",
+    "\n",
+    "The `mlserver` package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Overview\n",
+    "\n",
+    "In this example, we create a simple `Identity Text Model` which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Serving\n",
+    "\n",
+    "The next step will be to serve our model using `mlserver`. For that, we will first implement an extension that serves as the runtime to perform inference using our custom `TextModel`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Custom inference runtime\n",
+    "\n",
+    "This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following:\n",
+    "\n",
+    "- split the text into words using the white space as the delimiter.\n",
+    "- wait 0.5 seconds between each word to simulate a slow model.\n",
+    "- return each word one by one."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Overwriting text_model.py\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%writefile text_model.py\n",
+    "\n",
+    "import asyncio\n",
+    "from typing import AsyncIterator\n",
+    "from mlserver import MLModel\n",
+    "from mlserver.types import InferenceRequest, InferenceResponse\n",
+    "from mlserver.codecs import StringCodec\n",
+    "\n",
+    "\n",
+    "class TextModel(MLModel):\n",
+    "\n",
+    "    async def predict_stream(\n",
+    "        self, payloads: AsyncIterator[InferenceRequest]\n",
+    "    ) -> AsyncIterator[InferenceResponse]:\n",
+    "        payload = [_ async for _ in payloads][0]\n",
+    "        text = StringCodec.decode_input(payload.inputs[0])[0]\n",
+    "        words = text.split(\" \")\n",
+    "\n",
+    "        split_text = []\n",
+    "        for i, word in enumerate(words):\n",
+    "            split_text.append(word if i == 0 else \" \" + word)\n",
+    "\n",
+    "        for word in split_text:\n",
+    "            await asyncio.sleep(0.5)\n",
+    "            yield InferenceResponse(\n",
+    "                model_name=self._settings.name,\n",
+    "                outputs=[\n",
+    "                    StringCodec.encode_output(\n",
+    "                        name=\"output\",\n",
+    "                        payload=[word],\n",
+    "                        use_bytes=True,\n",
+    "                    ),\n",
+    "                ],\n",
+    "            )\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As it can be seen, the `predict_stream` method receives as an input an `AsyncIterator` of `InferenceRequest` and returns an `AsyncIterator` of `InferenceResponse`. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori.\n",
+    "\n",
+    "Note that although unary-unary can be covered by `predict_stream` method as well, `mlserver` already covers that through the `predict` method.\n",
+    "\n",
+    "One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Settings file\n",
+    "\n",
+    "The next step will be to create 2 configuration files:\n",
+    "- `settings.json`: holds the configuration of our server (e.g. ports, log level, etc.).\n",
+    "- `model-settings.json`: holds the configuration of our model (e.g. input type, runtime to use, etc.)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### settings.json"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Overwriting settings.json\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%writefile settings.json\n",
+    "\n",
+    "{\n",
+    "  \"debug\": false,\n",
+    "  \"parallel_workers\": 0,\n",
+    "  \"gzip_enabled\": false,\n",
+    "  \"metrics_endpoint\": null\n",
+    "}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note the currently there are three main limitations of the streaming support in MLServer:\n",
+    "\n",
+    "- distributed workers are not supported (i.e., the `parallel_workers` setting should be set to `0`)\n",
+    "- `gzip` middleware is not supported for REST (i.e., `gzip_enabled` setting should be set to `false`)\n",
+    "- metrics endpoint is not available (i.e. `metrics_endpoint` is also disabled for streaming for gRPC)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### model-settings.json"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Overwriting model-settings.json\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%writefile model-settings.json\n",
+    "\n",
+    "{\n",
+    "  \"name\": \"text-model\",\n",
+    "\n",
+    "  \"implementation\": \"text_model.TextModel\",\n",
+    "  \n",
+    "  \"versions\": [\"text-model/v1.2.3\"],\n",
+    "  \"platform\": \"mlserver\",\n",
+    "  \"inputs\": [\n",
+    "    {\n",
+    "      \"datatype\": \"BYTES\",\n",
+    "      \"name\": \"prompt\",\n",
+    "      \"shape\": [1]\n",
+    "    }\n",
+    "  ],\n",
+    "  \"outputs\": [\n",
+    "    {\n",
+    "      \"datatype\": \"BYTES\",\n",
+    "      \"name\": \"output\",\n",
+    "      \"shape\": [1]\n",
+    "    }\n",
+    "  ]\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Start serving the model\n",
+    "\n",
+    "Now that we have our config in-place, we can start the server by running `mlserver start .`. This needs to either be run from the same directory where our config files are or point to the folder where they are.\n",
+    "\n",
+    "```bash\n",
+    "mlserver start .\n",
+    "```\n",
+    "\n",
+    "Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Inference request\n",
+    "\n",
+    "To test our model, we will use the following inference request:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Writing generate-request.json\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%writefile generate-request.json\n",
+    "\n",
+    "{\n",
+    "    \"inputs\": [\n",
+    "        {\n",
+    "            \"name\": \"prompt\",\n",
+    "            \"shape\": [1],\n",
+    "            \"datatype\": \"BYTES\",\n",
+    "            \"data\": [\"What is the capital of France?\"],\n",
+    "            \"parameters\": {\n",
+    "            \"content_type\": \"str\"\n",
+    "            }\n",
+    "        }\n",
+    "    ],\n",
+    "    \"outputs\": [\n",
+    "        {\n",
+    "          \"name\": \"output\"\n",
+    "        }\n",
+    "    ]\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Send test generate stream request (REST)\n",
+    "\n",
+    "To send a REST streaming request to the server, we will use the following Python code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import httpx\n",
+    "from httpx_sse import connect_sse\n",
+    "from mlserver import types\n",
+    "from mlserver.codecs import StringCodec\n",
+    "\n",
+    "inference_request = types.InferenceRequest.parse_file(\"./generate-request.json\")\n",
+    "\n",
+    "with httpx.Client() as client:\n",
+    "    with connect_sse(client, \"POST\", \"http://localhost:8080/v2/models/text-model/generate_stream\", json=inference_request.dict()) as event_source:\n",
+    "        for sse in event_source.iter_sse():\n",
+    "            response = types.InferenceResponse.parse_raw(sse.data)\n",
+    "            print(StringCodec.decode_output(response.outputs[0]))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Send test generate stream request (gRPC)\n",
+    "\n",
+    "To send a gRPC streaming request to the server, we will use the following Python code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import grpc\n",
+    "import mlserver.types as types\n",
+    "from mlserver.codecs import StringCodec\n",
+    "from mlserver.grpc.converters import ModelInferResponseConverter\n",
+    "import mlserver.grpc.converters as converters\n",
+    "import mlserver.grpc.dataplane_pb2_grpc as dataplane\n",
+    "\n",
+    "inference_request = types.InferenceRequest.parse_file(\"./generate-request.json\")\n",
+    "\n",
+    "# need to convert from string to bytes for grpc\n",
+    "inference_request.inputs[0] = StringCodec.encode_input(\"prompt\", inference_request.inputs[0].data.__root__)\n",
+    "inference_request_g = converters.ModelInferRequestConverter.from_types(\n",
+    "    inference_request, model_name=\"text-model\", model_version=None\n",
+    ")\n",
+    "\n",
+    "async def get_inference_request_stream(inference_request):\n",
+    "    yield inference_request\n",
+    "\n",
+    "async with grpc.aio.insecure_channel(\"localhost:8081\") as grpc_channel:\n",
+    "    grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)\n",
+    "    inference_request_stream = get_inference_request_stream(inference_request_g)\n",
+    "    \n",
+    "    async for response in grpc_stub.ModelStreamInfer(inference_request_stream):\n",
+    "        response = ModelInferResponseConverter.to_types(response)\n",
+    "        print(StringCodec.decode_output(response.outputs[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that for gRPC, the request is transformed into an async generator which is then passed to the `ModelStreamInfer` method. The response is also an async generator which can be iterated over to get the response."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3.10",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}