-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[server] Refactor + OpenAI Chat Completion Support #1288
Conversation
…for chat completion streaming and non streaming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great - especially feeling good that we were able to migrate the existing tests
* update server * allow users to send requests with new models * use v1; move around baseroutes * add openai path * PR comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, clean code! one small nit is that there are places where Google docstring format is used, whereas we prefer rst in the rest of our codebase; but not a blocker from my side!
@dsikka towards the point of OpenAI not having a do_sample argument, we should set |
raise ValueError( | ||
f"{integration} is not a supported integration. Must be " | ||
"one of local, sagemkaer or openai." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo, also maybe worth defining a list of supported integrations as a global constant? Then we can just print out the list of supported integrations instead of having this as a hardcoded comment
OPENAI_TO_DEEPSPARSE_MAPPINGS = { | ||
"max_tokens": "max_length", | ||
"frequency_penalty": "repetition_penalty", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is there a reason that we don't just use the openai namings?
"""The output data of one completion output of a request. | ||
|
||
Args: | ||
index: The index of the output in the request. | ||
text: The generated output text. | ||
token_ids: The token IDs of the generated output text. | ||
cumulative_logprob: The cumulative log probability of the generated | ||
output text. | ||
logprobs: The log probabilities of the top probability words at each | ||
position if the logprobs are requested. | ||
finish_reason: The reason why the sequence is finished. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another nit but this is a different docstring format than I've seen in other parts of our codebase. Would expect something like this:
"""
blah blah blah
:param param1:
:param param2:
"""
# For deepsparse endpoints, we bind the `predict`` and `predict_from_files` functions to | ||
# each of the added routes. As we require access to the pipeline to run inference on | ||
# each request, instead of binding `predict`, we bind `partial(predict, pipeline ...)` | ||
# so that there is access to the pipelne when handling each request. However, fastapi | ||
# has trouble with validating the pipeline type. As a workaround, we can wrap each | ||
# pipelinne as a `ProxyPipeline` which can be resolved by fastapi. | ||
class ProxyPipeline: | ||
def __init__(self, pipeline: Pipeline): | ||
self.pipeline = pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :)
* update/clean-up server to match mlserver docs * update server tests * add back ping * [server] Refactor + OpenAI Chat Completion Support (#1288) * refactor server for different integrations; additional functionality for chat completion streaming and non streaming * further refactor server * add support such that openai can host multiple models * update all tests * fix output for n > 1 * add inline comment explaining ProxyPipeline * [server] Update OpenAI Model Support (#1300) * update server * allow users to send requests with new models * use v1; move around baseroutes * add openai path * PR comments * clean-up output classes to be dataclasses, add docstrings, cleanup generation kwargs * update readme, update route cleaning, update docstring * fix README for QA
* update/clean-up server to match mlserver docs * update server tests * add back ping * [server] Refactor + OpenAI Chat Completion Support (#1288) * refactor server for different integrations; additional functionality for chat completion streaming and non streaming * further refactor server * add support such that openai can host multiple models * update all tests * fix output for n > 1 * add inline comment explaining ProxyPipeline * [server] Update OpenAI Model Support (#1300) * update server * allow users to send requests with new models * use v1; move around baseroutes * add openai path * PR comments * clean-up output classes to be dataclasses, add docstrings, cleanup generation kwargs * update readme, update route cleaning, update docstring * fix README for QA * add openai doc * update docs * Update src/deepsparse/server/openai.md Co-authored-by: Domenic Barbuzzi <domenic@neuralmagic.com> --------- Co-authored-by: Domenic Barbuzzi <domenic@neuralmagic.com>
Summary
Couple of notes about the current integration:
/v1/chat/completion
endpoint is supported (and also now the/v1/models
endpoint as a result of the merging in: [server] Update OpenAI Model Support #1300)ChatCompletionRequest
is what is expected by the endpoints. Not all the properties supported by the pipeline are supported by the request and vice versa. For example, the request does not seem to have a way to setdo_sample
whereas our pipeline does not currently support"logit_bias", "best_ok", "ignore_eos", "use_beam_search"
. We map attributes from theChatCompletiionRequest
to the text generation pipeline using the following mapping:/v1/chat/completions
endpointUnit Testing
Testing and Examples - Existing Server
Sample Config:
Launch a server:
deepsparse.server --config_file sample_config.yaml
Requests can be sent how they were sent previously:
Testing and Examples - OpenAI
OpenAI
can be used through the integration input, similar to howsagemaker
was used before:deepsparse.server --config_file sample_config.yaml --integration openai
. There is also a dedicated workflow, through thedeepsparse.openai sample_config.yaml
commandtask_generation
(and its aliases) are supported. Example server config is shown below.requests
library, curl commands or through the OpenAI APIChatCompletionRequest
given by OpenAI, and all the parameters are mapped to ourtext_generation
pipelineCurl Commands:
Output:
OpenAI API
stream
flag toTrue
)