-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: accept legacy request format and response #1527
Conversation
To be most useful for code completions this API would need to support FIM models, which I believe means adding the https://platform.openai.com/docs/api-reference/fine-tuning/create#fine-tuning-create-suffix Here's a way I could see this working:
|
Hey @spew thanks for the details! I've reviewed the docs a bit more and it seems like the |
for reference: example non streaming request curl -s localhost:3000/v1/completions \
-X POST \
-d '{
"model": "tgi",
"prompt": "What color is the sky?",
"max_tokens": 10,
"seed": 0,
"stream": false,
"decoder_input_details": false,
"suffix": " hello world"
}' \
-H 'Content-Type: application/json' | jq . response {
"id": "",
"object": "text_completion",
"created": 1707233606,
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"system_fingerprint": "1.4.0-native",
"choices": [
{
"index": 0,
"text": " Blue.\" Well, let's break that down logically hello world",
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 6,
"completion_tokens": 10,
"total_tokens": 16
}
} example streaming request curl -s localhost:3000/v1/completions \
-X POST \
-d '{
"model": "tgi",
"prompt": "What color is the sky?",
"max_tokens": 10,
"seed": 0,
"stream": true,
"decoder_input_details": false,
"suffix": " hello world"
}' \
-H 'Content-Type: application/json' responses
|
hi @drbh, The use case would be code completion extensions for IDEs, like Github Copilot. They do "fill in the middle" completion which improves results. So when a developer stops typing in their editor, they send in the code before the cursor as the prefix and the code after the cursor as the suffix. I'm 99% sure that on the OpenAI side they are not simply joining the strings but are instead intelligently feeding them into the model with the appropriate tokens marking the prefix and suffix. For example, given the following completion request:
If we were to feed this into a CodeLlama model, which has special tokens of
|
ca73ca3
to
d95bbd8
Compare
updates** this PR now reads the example ....
"spaces_between_special_tokens": false,
"tokenizer_class": "LlamaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false,
"completion_template": "<PRE> {{ prompt }} <SUF> {{ suffix }} <MID>"
} example request curl -s localhost:3000/v1/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "tgi",
"prompt": "What color is the sky?",
"max_tokens": 10,
"seed": 0,
"stream": false,
"decoder_input_details": false,
"suffix": "before night falls"
}' | jq . response {
"id": "",
"object": "text_completion",
"created": 1708443875,
"model": "HuggingFaceH4/zephyr-7b-beta",
"system_fingerprint": "1.4.1-native",
"choices": [
{
"index": 0,
"text": " red, orange and pink like waves of fire,",
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 21,
"completion_tokens": 10,
"total_tokens": 31
}
} **note this PR is still a work in progress and its unclear if supporting |
64bd96e
to
6a0f737
Compare
router/src/infer.rs
Outdated
@@ -88,6 +90,10 @@ impl Infer { | |||
.chat_template | |||
.map(|t| ChatTemplate::new(t, tokenizer_config.bos_token, tokenizer_config.eos_token)); | |||
|
|||
let completion_template = tokenizer_config | |||
.completion_template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a standard amongst models :
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/main/tokenizer_config.json
afaik there's no need for a template...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is if we want the API to support Fill in the Middle: #1527 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except nothing in HF ecosystem supports this. Therefore it doesn't exist.
And if we added such a thing we'd commit to it, which is not the appropriate place or project for it (we usually start to implement things like this in transformers
first, since it will have to check for much wider usability).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary use case across HF would probably be this API.
Another potential larger use case is I can see how there also could be a utility method which enables "easy" token encoding that would enable developers to send in a prefix & suffix and get back a list of tokens which has been properly created with FIM tokens (as a bonus it could also add the BOS and EOS tokens as well).
Alternatively, if we want to support FIM models with this API (we sorta do if we want it to be most useful), we could only store the per model chat template in this project, not in tokenizer_config.json
and worry about moving it out of the project later. So this would mean having a configuration file in this project which has a mapping from model -> chat template. I think there is some precedent for this since this project has a lot of per-model loading code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @spew thanks for the insights and feedback, we've given it some thought as a team and decided we're going to avoid supporting the completions template.
The tokenizer_config.json
is defined/maintained outside of TGI, and adding TGI specific/custom values there does not seem like the best path forward. Additionally the template would have only defined how two strings the prompt
and suffix
were concatenated and this can be done by the caller with simple preprocessing. While we aim to align with existing tools/apis, it does not make sense to add this complexity/overhead for a legacy api.
Really appreciate you bringing this up, though! Always good to see different angles and possibilities, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood! It does mean this API will be less useful for FIM models which means it will not be a drop-in replacement for some (possibly many?) tools built on the legacy OpenAI completions API. This is cumbersome when you don't control the source of those tools but desire to move off of OpenAI to your own hosting.
IMO this defeats most of the value prop for the API -- but I don't have insight into all the internal reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do the prefix/suffix before sending the prompts. It's much more generic anyway.
but I don't have insight into all the internal reasoning.
You're more than welcome to fork this repo, and maintain this yourself. We just don't have the bandwidth for that.
If you want to contribute such a thing in the HF ecosystem, try implementing it into transformers
first.
As said, this is the place where the normalization is created and maintained across our ecosytem for the configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. It does seem like transformer level support for FIM models would make things a lot easier and less error prone so maybe there's an opportunity to do something there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This WIP PR (will) add support for legacy OpenAI `v1/completions` API. This should allow TGI to be a drop in replacement for OpenAI when using tools that rely on the completions api Should fix: huggingface#1468
This WIP PR (will) add support for legacy OpenAI
v1/completions
API.This should allow TGI to be a drop in replacement for OpenAI when using tools that rely on the completions api
Should fix: #1468