diff --git a/docs/examples/index.md b/docs/examples/index.md index ea4d75000..a818ff76f 100644 --- a/docs/examples/index.md +++ b/docs/examples/index.md @@ -49,6 +49,7 @@ serving), check out the examples below. - [Custom Conda environment](./conda/README.md) - [Serving custom models requiring JSON inputs or outputs](./custom-json/README.md) - [Serving models through Kafka](./kafka/README.md) +- [Streaming inference](./streaming/README.md) ```{toctree} :caption: MLServer Features @@ -61,6 +62,7 @@ serving), check out the examples below. ./conda/README.md ./custom-json/README.md ./kafka/README.md +./streaming/README.md ``` ## Tutorials diff --git a/docs/examples/streaming/README.ipynb b/docs/examples/streaming/README.ipynb new file mode 100644 index 000000000..025246237 --- /dev/null +++ b/docs/examples/streaming/README.ipynb @@ -0,0 +1,369 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Streaming support\n", + "\n", + "The `mlserver` package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "In this example, we create a simple `Identity Text Model` which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Serving\n", + "\n", + "The next step will be to serve our model using `mlserver`. For that, we will first implement an extension that serves as the runtime to perform inference using our custom `TextModel`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Custom inference runtime\n", + "\n", + "This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following:\n", + "\n", + "- split the text into words using the white space as the delimiter.\n", + "- wait 0.5 seconds between each word to simulate a slow model.\n", + "- return each word one by one." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting text_model.py\n" + ] + } + ], + "source": [ + "%%writefile text_model.py\n", + "\n", + "import asyncio\n", + "from typing import AsyncIterator\n", + "from mlserver import MLModel\n", + "from mlserver.types import InferenceRequest, InferenceResponse\n", + "from mlserver.codecs import StringCodec\n", + "\n", + "\n", + "class TextModel(MLModel):\n", + "\n", + " async def predict_stream(\n", + " self, payloads: AsyncIterator[InferenceRequest]\n", + " ) -> AsyncIterator[InferenceResponse]:\n", + " payload = [_ async for _ in payloads][0]\n", + " text = StringCodec.decode_input(payload.inputs[0])[0]\n", + " words = text.split(\" \")\n", + "\n", + " split_text = []\n", + " for i, word in enumerate(words):\n", + " split_text.append(word if i == 0 else \" \" + word)\n", + "\n", + " for word in split_text:\n", + " await asyncio.sleep(0.5)\n", + " yield InferenceResponse(\n", + " model_name=self._settings.name,\n", + " outputs=[\n", + " StringCodec.encode_output(\n", + " name=\"output\",\n", + " payload=[word],\n", + " use_bytes=True,\n", + " ),\n", + " ],\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As it can be seen, the `predict_stream` method receives as an input an `AsyncIterator` of `InferenceRequest` and returns an `AsyncIterator` of `InferenceResponse`. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori.\n", + "\n", + "Note that although unary-unary can be covered by `predict_stream` method as well, `mlserver` already covers that through the `predict` method.\n", + "\n", + "One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Settings file\n", + "\n", + "The next step will be to create 2 configuration files:\n", + "- `settings.json`: holds the configuration of our server (e.g. ports, log level, etc.).\n", + "- `model-settings.json`: holds the configuration of our model (e.g. input type, runtime to use, etc.)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### settings.json" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting settings.json\n" + ] + } + ], + "source": [ + "%%writefile settings.json\n", + "\n", + "{\n", + " \"debug\": false,\n", + " \"parallel_workers\": 0,\n", + " \"gzip_enabled\": false,\n", + " \"metrics_endpoint\": null\n", + "}\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note the currently there are three main limitations of the streaming support in MLServer:\n", + "\n", + "- distributed workers are not supported (i.e., the `parallel_workers` setting should be set to `0`)\n", + "- `gzip` middleware is not supported for REST (i.e., `gzip_enabled` setting should be set to `false`)\n", + "- metrics endpoint is not available (i.e. `metrics_endpoint` is also disabled for streaming for gRPC)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### model-settings.json" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting model-settings.json\n" + ] + } + ], + "source": [ + "%%writefile model-settings.json\n", + "\n", + "{\n", + " \"name\": \"text-model\",\n", + "\n", + " \"implementation\": \"text_model.TextModel\",\n", + " \n", + " \"versions\": [\"text-model/v1.2.3\"],\n", + " \"platform\": \"mlserver\",\n", + " \"inputs\": [\n", + " {\n", + " \"datatype\": \"BYTES\",\n", + " \"name\": \"prompt\",\n", + " \"shape\": [1]\n", + " }\n", + " ],\n", + " \"outputs\": [\n", + " {\n", + " \"datatype\": \"BYTES\",\n", + " \"name\": \"output\",\n", + " \"shape\": [1]\n", + " }\n", + " ]\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Start serving the model\n", + "\n", + "Now that we have our config in-place, we can start the server by running `mlserver start .`. This needs to either be run from the same directory where our config files are or point to the folder where they are.\n", + "\n", + "```bash\n", + "mlserver start .\n", + "```\n", + "\n", + "Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Inference request\n", + "\n", + "To test our model, we will use the following inference request:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Writing generate-request.json\n" + ] + } + ], + "source": [ + "%%writefile generate-request.json\n", + "\n", + "{\n", + " \"inputs\": [\n", + " {\n", + " \"name\": \"prompt\",\n", + " \"shape\": [1],\n", + " \"datatype\": \"BYTES\",\n", + " \"data\": [\"What is the capital of France?\"],\n", + " \"parameters\": {\n", + " \"content_type\": \"str\"\n", + " }\n", + " }\n", + " ],\n", + " \"outputs\": [\n", + " {\n", + " \"name\": \"output\"\n", + " }\n", + " ]\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Send test generate stream request (REST)\n", + "\n", + "To send a REST streaming request to the server, we will use the following Python code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import httpx\n", + "from httpx_sse import connect_sse\n", + "from mlserver import types\n", + "from mlserver.codecs import StringCodec\n", + "\n", + "inference_request = types.InferenceRequest.parse_file(\"./generate-request.json\")\n", + "\n", + "with httpx.Client() as client:\n", + " with connect_sse(client, \"POST\", \"http://localhost:8080/v2/models/text-model/generate_stream\", json=inference_request.dict()) as event_source:\n", + " for sse in event_source.iter_sse():\n", + " response = types.InferenceResponse.parse_raw(sse.data)\n", + " print(StringCodec.decode_output(response.outputs[0]))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Send test generate stream request (gRPC)\n", + "\n", + "To send a gRPC streaming request to the server, we will use the following Python code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import grpc\n", + "import mlserver.types as types\n", + "from mlserver.codecs import StringCodec\n", + "from mlserver.grpc.converters import ModelInferResponseConverter\n", + "import mlserver.grpc.converters as converters\n", + "import mlserver.grpc.dataplane_pb2_grpc as dataplane\n", + "\n", + "inference_request = types.InferenceRequest.parse_file(\"./generate-request.json\")\n", + "\n", + "# need to convert from string to bytes for grpc\n", + "inference_request.inputs[0] = StringCodec.encode_input(\"prompt\", inference_request.inputs[0].data.__root__)\n", + "inference_request_g = converters.ModelInferRequestConverter.from_types(\n", + " inference_request, model_name=\"text-model\", model_version=None\n", + ")\n", + "\n", + "async def get_inference_request_stream(inference_request):\n", + " yield inference_request\n", + "\n", + "async with grpc.aio.insecure_channel(\"localhost:8081\") as grpc_channel:\n", + " grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)\n", + " inference_request_stream = get_inference_request_stream(inference_request_g)\n", + " \n", + " async for response in grpc_stub.ModelStreamInfer(inference_request_stream):\n", + " response = ModelInferResponseConverter.to_types(response)\n", + " print(StringCodec.decode_output(response.outputs[0]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that for gRPC, the request is transformed into an async generator which is then passed to the `ModelStreamInfer` method. The response is also an async generator which can be iterated over to get the response." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3.10", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/examples/streaming/README.md b/docs/examples/streaming/README.md new file mode 100644 index 000000000..7acdf2090 --- /dev/null +++ b/docs/examples/streaming/README.md @@ -0,0 +1,217 @@ +# Streaming support + +The `mlserver` package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs. + +## Overview + +In this example, we create a simple `Identity Text Model` which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model. + +## Serving + +The next step will be to serve our model using `mlserver`. For that, we will first implement an extension that serves as the runtime to perform inference using our custom `TextModel`. + +### Custom inference runtime + +This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following: + +- split the text into words using the white space as the delimiter. +- wait 0.5 seconds between each word to simulate a slow model. +- return each word one by one. + + +```python +%%writefile text_model.py + +import asyncio +from typing import AsyncIterator +from mlserver import MLModel +from mlserver.types import InferenceRequest, InferenceResponse +from mlserver.codecs import StringCodec + + +class TextModel(MLModel): + + async def predict_stream( + self, payloads: AsyncIterator[InferenceRequest] + ) -> AsyncIterator[InferenceResponse]: + payload = [_ async for _ in payloads][0] + text = StringCodec.decode_input(payload.inputs[0])[0] + words = text.split(" ") + + split_text = [] + for i, word in enumerate(words): + split_text.append(word if i == 0 else " " + word) + + for word in split_text: + await asyncio.sleep(0.5) + yield InferenceResponse( + model_name=self._settings.name, + outputs=[ + StringCodec.encode_output( + name="output", + payload=[word], + use_bytes=True, + ), + ], + ) + +``` + +As it can be seen, the `predict_stream` method receives as an input an `AsyncIterator` of `InferenceRequest` and returns an `AsyncIterator` of `InferenceResponse`. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori. + +Note that although unary-unary can be covered by `predict_stream` method as well, `mlserver` already covers that through the `predict` method. + +One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above. + +### Settings file + +The next step will be to create 2 configuration files: +- `settings.json`: holds the configuration of our server (e.g. ports, log level, etc.). +- `model-settings.json`: holds the configuration of our model (e.g. input type, runtime to use, etc.). + +#### settings.json + + +```python +%%writefile settings.json + +{ + "debug": false, + "parallel_workers": 0, + "gzip_enabled": false, + "metrics_endpoint": null +} + +``` + +Note the currently there are three main limitations of the streaming support in MLServer: + +- distributed workers are not supported (i.e., the `parallel_workers` setting should be set to `0`) +- `gzip` middleware is not supported for REST (i.e., `gzip_enabled` setting should be set to `false`) +- metrics endpoint is not available (i.e. `metrics_endpoint` is also disabled for streaming for gRPC) + +#### model-settings.json + + +```python +%%writefile model-settings.json + +{ + "name": "text-model", + + "implementation": "text_model.TextModel", + + "versions": ["text-model/v1.2.3"], + "platform": "mlserver", + "inputs": [ + { + "datatype": "BYTES", + "name": "prompt", + "shape": [1] + } + ], + "outputs": [ + { + "datatype": "BYTES", + "name": "output", + "shape": [1] + } + ] +} +``` + +#### Start serving the model + +Now that we have our config in-place, we can start the server by running `mlserver start .`. This needs to either be run from the same directory where our config files are or point to the folder where they are. + +```bash +mlserver start . +``` + +Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal. + +#### Inference request + +To test our model, we will use the following inference request: + + +```python +%%writefile generate-request.json + +{ + "inputs": [ + { + "name": "prompt", + "shape": [1], + "datatype": "BYTES", + "data": ["What is the capital of France?"], + "parameters": { + "content_type": "str" + } + } + ], + "outputs": [ + { + "name": "output" + } + ] +} +``` + +### Send test generate stream request (REST) + +To send a REST streaming request to the server, we will use the following Python code: + + +```python +import httpx +from httpx_sse import connect_sse +from mlserver import types +from mlserver.codecs import StringCodec + +inference_request = types.InferenceRequest.parse_file("./generate-request.json") + +with httpx.Client() as client: + with connect_sse(client, "POST", "http://localhost:8080/v2/models/text-model/generate_stream", json=inference_request.dict()) as event_source: + for sse in event_source.iter_sse(): + response = types.InferenceResponse.parse_raw(sse.data) + print(StringCodec.decode_output(response.outputs[0])) + +``` + +### Send test generate stream request (gRPC) + +To send a gRPC streaming request to the server, we will use the following Python code: + + +```python +import grpc +import mlserver.types as types +from mlserver.codecs import StringCodec +from mlserver.grpc.converters import ModelInferResponseConverter +import mlserver.grpc.converters as converters +import mlserver.grpc.dataplane_pb2_grpc as dataplane + +inference_request = types.InferenceRequest.parse_file("./generate-request.json") + +# need to convert from string to bytes for grpc +inference_request.inputs[0] = StringCodec.encode_input("prompt", inference_request.inputs[0].data.__root__) +inference_request_g = converters.ModelInferRequestConverter.from_types( + inference_request, model_name="text-model", model_version=None +) + +async def get_inference_request_stream(inference_request): + yield inference_request + +async with grpc.aio.insecure_channel("localhost:8081") as grpc_channel: + grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel) + inference_request_stream = get_inference_request_stream(inference_request_g) + + async for response in grpc_stub.ModelStreamInfer(inference_request_stream): + response = ModelInferResponseConverter.to_types(response) + print(StringCodec.decode_output(response.outputs[0])) +``` + +Note that for gRPC, the request is transformed into an async generator which is then passed to the `ModelStreamInfer` method. The response is also an async generator which can be iterated over to get the response. + + diff --git a/docs/examples/streaming/generate-request.json b/docs/examples/streaming/generate-request.json new file mode 100644 index 000000000..a61935f08 --- /dev/null +++ b/docs/examples/streaming/generate-request.json @@ -0,0 +1,19 @@ + +{ + "inputs": [ + { + "name": "prompt", + "shape": [1], + "datatype": "BYTES", + "data": ["What is the capital of France?"], + "parameters": { + "content_type": "str" + } + } + ], + "outputs": [ + { + "name": "output" + } + ] +} diff --git a/docs/examples/streaming/model-settings.json b/docs/examples/streaming/model-settings.json new file mode 100644 index 000000000..caf8a6ad3 --- /dev/null +++ b/docs/examples/streaming/model-settings.json @@ -0,0 +1,23 @@ + +{ + "name": "text-model", + + "implementation": "text_model.TextModel", + + "versions": ["text-model/v1.2.3"], + "platform": "mlserver", + "inputs": [ + { + "datatype": "BYTES", + "name": "prompt", + "shape": [1] + } + ], + "outputs": [ + { + "datatype": "BYTES", + "name": "output", + "shape": [1] + } + ] +} diff --git a/docs/examples/streaming/settings.json b/docs/examples/streaming/settings.json new file mode 100644 index 000000000..ec853b3ba --- /dev/null +++ b/docs/examples/streaming/settings.json @@ -0,0 +1,7 @@ + +{ + "debug": false, + "parallel_workers": 0, + "gzip_enabled": false, + "metrics_endpoint": null +} diff --git a/docs/examples/streaming/text_model.py b/docs/examples/streaming/text_model.py new file mode 100644 index 000000000..4475b3c92 --- /dev/null +++ b/docs/examples/streaming/text_model.py @@ -0,0 +1,45 @@ +import asyncio +from typing import AsyncIterator +from mlserver import MLModel +from mlserver.types import InferenceRequest, InferenceResponse +from mlserver.codecs import StringCodec + + +class TextModel(MLModel): + + async def predict(self, payload: InferenceRequest) -> InferenceResponse: + text = StringCodec.decode_input(payload.inputs[0])[0] + return InferenceResponse( + model_name=self._settings.name, + outputs=[ + StringCodec.encode_output( + name="output", + payload=[text], + use_bytes=True, + ), + ], + ) + + async def predict_stream( + self, payloads: AsyncIterator[InferenceRequest] + ) -> AsyncIterator[InferenceResponse]: + payload = [_ async for _ in payloads][0] + text = StringCodec.decode_input(payload.inputs[0])[0] + words = text.split(" ") + + split_text = [] + for i, word in enumerate(words): + split_text.append(word if i == 0 else " " + word) + + for word in split_text: + await asyncio.sleep(0.5) + yield InferenceResponse( + model_name=self._settings.name, + outputs=[ + StringCodec.encode_output( + name="output", + payload=[word], + use_bytes=True, + ), + ], + ) diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md index c9e1b65cb..cf5dee500 100644 --- a/docs/user-guide/index.md +++ b/docs/user-guide/index.md @@ -13,4 +13,5 @@ how to use them. ./custom ./metrics ./deployment/index +./streaming ``` diff --git a/docs/user-guide/streaming.md b/docs/user-guide/streaming.md new file mode 100644 index 000000000..41dec0b03 --- /dev/null +++ b/docs/user-guide/streaming.md @@ -0,0 +1,35 @@ +# Streaming + +Out of the box, MLServer includes support for streaming data to your models. Streaming support is available for both the REST and gRPC servers. + + +## REST Server + +Streaming support for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data. + +The streaming endpoints are available for both the `infer` and `generate` methods through the following endpoints: + +- `/v2/models/{model_name}/versions/{model_version}/infer_stream` +- `/v2/models/{model_name}/infer_stream` +- `/v2/models/{model_name}/versions/{model_version}/generate_stream` +- `/v2/models/{model_name}/generate_stream` + +Note that for REST, the `generate` and `generate_stream` endpoints are aliases for the `infer` and `infer_stream` endpoints, respectively. Those names are used to better reflect the nature of the operation for Large Language Models (LLMs). + + +## gRPC Server + +Streaming support for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data. + +The two streams operate independently, so the client and the server can read and write data however they want (e.g., the server could either wait to receive all the client messages before sending a response or it can send a response after each message). Note that bi-directional streaming covers all the possible combinations of client and server streaming: unary-stream, stream-unary, and stream-stream. The unary-unary case can be covered as well by the bi-directional streaming, but `mlserver` already has the `predict` method dedicated to this use case. The logic for how the requests are received, and processed, and the responses are sent back should be built into the runtime logic. + +The stub method for streaming to be used by the client is `ModelStreamInfer`. + + +## Limitation + +There are three main limitations of the streaming support in MLServer: + +- the `parallel_workers` setting should be set to `0` to disable distributed workers (to be addressed in future releases) +- for REST, the `gzip_enabled` setting should be set to `false` to disable GZIP compression, as streaming is not compatible with GZIP compression (see issue [here]( https://github.com/encode/starlette/issues/20#issuecomment-704106436)) +- `metrics_endpoint` is also disabled for streaming for gRPC (to be addressed in future releases) \ No newline at end of file diff --git a/mlserver/rest/openapi/dataplane.json b/mlserver/rest/openapi/dataplane.json index 2f02f0938..268e36ff4 100644 --- a/mlserver/rest/openapi/dataplane.json +++ b/mlserver/rest/openapi/dataplane.json @@ -385,9 +385,485 @@ "model" ] } + }, + "/v2/models/{model_name}/versions/{model_version}/infer_stream": { + "parameters": [ + { + "schema": { + "type": "string" + }, + "name": "model_name", + "in": "path", + "required": true + }, + { + "schema": { + "type": "string" + }, + "name": "model_version", + "in": "path", + "required": true + } + ], + "post": { + "summary": "Model Inference Stream", + "operationId": "model-version-inference-stream", + "responses": { + "200": { + "description": "OK", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceResponse" + } + } + } + }, + "400": { + "description": "Bad Request", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "404": { + "description": "Not Found", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "500": { + "description": "Internal Server Error", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + } + }, + "requestBody": { + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceRequest" + } + } + } + }, + "description": "An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.", + "tags": [ + "inference", + "model" + ] + } + }, + "/v2/models/{model_name}/infer_stream": { + "parameters": [ + { + "schema": { + "type": "string" + }, + "name": "model_name", + "in": "path", + "required": true + } + ], + "post": { + "summary": "Model Inference Stream", + "operationId": "model-inference-stream", + "responses": { + "200": { + "description": "OK", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceResponse" + } + } + } + }, + "400": { + "description": "Bad Request", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "404": { + "description": "Not Found", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "500": { + "description": "Internal Server Error", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + } + }, + "requestBody": { + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceRequest" + } + } + } + }, + "description": "An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.", + "tags": [ + "inference", + "model" + ] + } + }, + "/v2/models/{model_name}/versions/{model_version}/generate": { + "parameters": [ + { + "schema": { + "type": "string" + }, + "name": "model_name", + "in": "path", + "required": true + }, + { + "schema": { + "type": "string" + }, + "name": "model_version", + "in": "path", + "required": true + } + ], + "post": { + "summary": "Model Generate", + "operationId": "model-version-generate", + "responses": { + "200": { + "description": "OK", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceResponse" + } + } + } + }, + "400": { + "description": "Bad Request", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "404": { + "description": "Not Found", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "500": { + "description": "Internal Server Error", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + } + }, + "requestBody": { + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceRequest" + } + } + } + }, + "description": "An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.", + "tags": [ + "inference", + "model" + ] + } + }, + "/v2/models/{model_name}/generate": { + "parameters": [ + { + "schema": { + "type": "string" + }, + "name": "model_name", + "in": "path", + "required": true + } + ], + "post": { + "summary": "Model Generate", + "operationId": "model-generate", + "responses": { + "200": { + "description": "OK", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceResponse" + } + } + } + }, + "400": { + "description": "Bad Request", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "404": { + "description": "Not Found", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "500": { + "description": "Internal Server Error", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + } + }, + "requestBody": { + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceRequest" + } + } + } + }, + "description": "An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.", + "tags": [ + "inference", + "model" + ] + } + }, + "/v2/models/{model_name}/versions/{model_version}/generate_stream": { + "parameters": [ + { + "schema": { + "type": "string" + }, + "name": "model_name", + "in": "path", + "required": true + }, + { + "schema": { + "type": "string" + }, + "name": "model_version", + "in": "path", + "required": true + } + ], + "post": { + "summary": "Model Generate Stream", + "operationId": "model-version-generate-stream", + "responses": { + "200": { + "description": "OK", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceResponse" + } + } + } + }, + "400": { + "description": "Bad Request", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "404": { + "description": "Not Found", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "500": { + "description": "Internal Server Error", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + } + }, + "requestBody": { + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceRequest" + } + } + } + }, + "description": "An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.", + "tags": [ + "inference", + "model" + ] + } + }, + "/v2/models/{model_name}/generate_stream": { + "parameters": [ + { + "schema": { + "type": "string" + }, + "name": "model_name", + "in": "path", + "required": true + } + ], + "post": { + "summary": "Model Generate Stream", + "operationId": "model-generate-stream", + "responses": { + "200": { + "description": "OK", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceResponse" + } + } + } + }, + "400": { + "description": "Bad Request", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "404": { + "description": "Not Found", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + }, + "500": { + "description": "Internal Server Error", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceErrorResponse" + } + } + } + } + }, + "requestBody": { + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/InferenceRequest" + } + } + } + }, + "description": "An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.", + "tags": [ + "inference", + "model" + ] + } } }, "components": { + "enums": { + "Datatype": { + "type": "string", + "enum": [ + "BOOL", + "UINT8", + "UINT16", + "UINT32", + "UINT64", + "INT8", + "INT16", + "INT32", + "INT64", + "FP16", + "FP32", + "FP64", + "BYTES" + ] + } + }, "schemas": { "MetadataServerResponse": { "title": "MetadataServerResponse", @@ -471,7 +947,7 @@ "type": "string" }, "datatype": { - "type": "string" + "$ref": "#/components/enums/Datatype" }, "shape": { "type": "array", @@ -613,7 +1089,7 @@ } }, "datatype": { - "type": "string" + "$ref": "#/components/enums/Datatype" }, "parameters": { "$ref": "#/components/schemas/Parameters" @@ -630,7 +1106,15 @@ ] }, "TensorData": { - "title": "TensorData" + "title": "TensorData", + "oneOf": [ + { + "type": "array" + }, + { + "type": "bytes" + } + ] }, "RequestOutput": { "title": "RequestOutput", @@ -661,7 +1145,7 @@ } }, "datatype": { - "type": "string" + "$ref": "#/components/enums/Datatype" }, "parameters": { "$ref": "#/components/schemas/Parameters" @@ -733,4 +1217,4 @@ "name": "server" } ] -} \ No newline at end of file +} diff --git a/openapi/dataplane.yaml b/openapi/dataplane.yaml index 8ce22b597..9417af0b8 100644 --- a/openapi/dataplane.yaml +++ b/openapi/dataplane.yaml @@ -245,6 +245,285 @@ paths: tags: - inference - model + '/v2/models/{model_name}/versions/{model_version}/infer_stream': + parameters: + - schema: + type: string + name: model_name + in: path + required: true + - schema: + type: string + name: model_version + in: path + required: true + post: + summary: Model Inference Stream + operationId: model-version-inference-stream + responses: + '200': + description: OK + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceResponse' + '400': + description: Bad Request + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '404': + description: Not Found + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '500': + description: Internal Server Error + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + requestBody: + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceRequest' + description: An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. + tags: + - inference + - model + '/v2/models/{model_name}/infer_stream': + parameters: + - schema: + type: string + name: model_name + in: path + required: true + post: + summary: Model Inference Stream + operationId: model-inference-stream + responses: + '200': + description: OK + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceResponse' + '400': + description: Bad Request + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '404': + description: Not Found + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '500': + description: Internal Server Error + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + requestBody: + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceRequest' + description: An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. + tags: + - inference + - model + '/v2/models/{model_name}/versions/{model_version}/generate': + parameters: + - schema: + type: string + name: model_name + in: path + required: true + - schema: + type: string + name: model_version + in: path + required: true + post: + summary: Model Generate + operationId: model-version-generate + responses: + '200': + description: OK + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceResponse' + '400': + description: Bad Request + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '404': + description: Not Found + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '500': + description: Internal Server Error + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + requestBody: + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceRequest' + description: An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. + tags: + - inference + - model + '/v2/models/{model_name}/generate': + parameters: + - schema: + type: string + name: model_name + in: path + required: true + post: + summary: Model Generate + operationId: model-generate + responses: + '200': + description: OK + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceResponse' + '400': + description: Bad Request + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '404': + description: Not Found + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '500': + description: Internal Server Error + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + requestBody: + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceRequest' + description: An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. + tags: + - inference + - model + '/v2/models/{model_name}/versions/{model_version}/generate_stream': + parameters: + - schema: + type: string + name: model_name + in: path + required: true + - schema: + type: string + name: model_version + in: path + required: true + post: + summary: Model Generate Stream + operationId: model-version-generate-stream + responses: + '200': + description: OK + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceResponse' + '400': + description: Bad Request + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '404': + description: Not Found + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '500': + description: Internal Server Error + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + requestBody: + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceRequest' + description: An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. + tags: + - inference + - model + '/v2/models/{model_name}/generate_stream': + parameters: + - schema: + type: string + name: model_name + in: path + required: true + post: + summary: Model Generate Stream + operationId: model-generate-stream + responses: + '200': + description: OK + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceResponse' + '400': + description: Bad Request + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '404': + description: Not Found + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + '500': + description: Internal Server Error + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceErrorResponse' + requestBody: + content: + application/json: + schema: + $ref: '#/components/schemas/InferenceRequest' + description: An inference request is made with an HTTP POST to an inference endpoint. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. + tags: + - inference + - model components: enums: Datatype: