feat: accept legacy request format and response #1527

drbh · 2024-02-02T15:09:37Z

This WIP PR (will) add support for legacy OpenAI v1/completions API.

This should allow TGI to be a drop in replacement for OpenAI when using tools that rely on the completions api

Should fix: #1468

spew · 2024-02-02T15:48:50Z

To be most useful for code completions this API would need to support FIM models, which I believe means adding the suffix parameter to CompletionRequest

https://platform.openai.com/docs/api-reference/fine-tuning/create#fine-tuning-create-suffix

Here's a way I could see this working:

Add a suffix parameter to CompletionRequest
When a CompletionRequest comes in, see if the suffix field is present.
If the suffix field is not present, continue on as normal (as in this PR).
If the suffix field is present, check the model's tokenizer_config.json and look for a field named completionTemplate.
If the completionTemplate field is present, it should contain a string that tells you how to format the prompt (the prefix) and the suffix for "fill in the middle". Use it to convert to a single string that can be sent to generate.
If the completionTemplate field is not present, return a 400 error for bad request.

drbh · 2024-02-06T15:31:55Z

Hey @spew thanks for the details! I've reviewed the docs a bit more and it seems like the suffix may simply be appended to the end of the generated text rather than follow a model specific template. Would you be able to share an existing use case that requires a template rather then appending?

drbh · 2024-02-06T15:36:24Z

for reference:

example non streaming request

curl -s localhost:3000/v1/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "prompt": "What color is the sky?",
  "max_tokens": 10,
  "seed": 0,
  "stream": false,
  "decoder_input_details": false,
  "suffix": "   hello world"
}' \
    -H 'Content-Type: application/json' | jq .

response

{
  "id": "",
  "object": "text_completion",
  "created": 1707233606,
  "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
  "system_fingerprint": "1.4.0-native",
  "choices": [
    {
      "index": 0,
      "text": " Blue.\" Well, let's break that down logically   hello world",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 10,
    "total_tokens": 16
  }
}

example streaming request

 curl -s localhost:3000/v1/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "prompt": "What color is the sky?",
  "max_tokens": 10,
  "seed": 0,
  "stream": true,
  "decoder_input_details": false,
  "suffix": "   hello world"
}' \
    -H 'Content-Type: application/json'

responses

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":" Blue","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":".\"","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":" Well","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":",","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":" let","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":"'s","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":" break","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233723,"choices":[{"index":0,"text":" that","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233723,"choices":[{"index":0,"text":" down","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233723,"choices":[{"index":0,"text":" logically","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

spew · 2024-02-06T17:05:54Z

hi @drbh, The use case would be code completion extensions for IDEs, like Github Copilot. They do "fill in the middle" completion which improves results. So when a developer stops typing in their editor, they send in the code before the cursor as the prefix and the code after the cursor as the suffix. I'm 99% sure that on the OpenAI side they are not simply joining the strings but are instead intelligently feeding them into the model with the appropriate tokens marking the prefix and suffix. For example, given the following completion request:

{
  "prompt": "#!/usr/bin/env python3\n# print 1\nprint(\"1\")\n\n# print 2\n",
  "suffix": "# print 3\nprint(\"3\")"
}

If we were to feed this into a CodeLlama model, which has special tokens of <PRE> , <SUF>, and <MID>, then the final input string to generate should look something like this:

{
   "inputs": "<PRE> #!/usr/bin/env python3\n# print 1\nprint(\"1\")\n\n# print 2\n <SUF># print 3\nprint(\"3\") <MID>"
}

drbh · 2024-02-20T15:47:17Z

updates**

this PR now reads the completion_template field from tokenizer_config.json and uses the template to format the prompt and suffix variables.

example tokenizer_config.json

    ....
    "spaces_between_special_tokens": false,
    "tokenizer_class": "LlamaTokenizer",
    "unk_token": "<unk>",
    "use_default_system_prompt": false,
    "completion_template": "<PRE> {{ prompt }} <SUF> {{ suffix }} <MID>"
  }

example request

curl -s localhost:3000/v1/completions \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
  "model": "tgi",
  "prompt": "What color is the sky?",
  "max_tokens": 10,
  "seed": 0,
  "stream": false,
  "decoder_input_details": false,
  "suffix": "before night falls"
}' | jq .

response

{
  "id": "",
  "object": "text_completion",
  "created": 1708443875,
  "model": "HuggingFaceH4/zephyr-7b-beta",
  "system_fingerprint": "1.4.1-native",
  "choices": [
    {
      "index": 0,
      "text": " red, orange and pink like waves of fire,",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "completion_tokens": 10,
    "total_tokens": 31
  }
}

**note this PR is still a work in progress and its unclear if supporting completion_template is the best path forward. Note, it would also be possible for the caller to construct this template themselves and avoid the custom template field - however this diverges from openai's api and needs to be considered.

server/text_generation_server/models/__init__.py

router/src/infer.rs

Narsil · 2024-02-22T13:56:29Z

router/src/infer.rs

@@ -88,6 +90,10 @@ impl Infer {
            .chat_template
            .map(|t| ChatTemplate::new(t, tokenizer_config.bos_token, tokenizer_config.eos_token));

+        let completion_template = tokenizer_config
+            .completion_template


I don't think that's a standard amongst models :

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/main/tokenizer_config.json

afaik there's no need for a template...

There is if we want the API to support Fill in the Middle: #1527 (comment)

Except nothing in HF ecosystem supports this. Therefore it doesn't exist.
And if we added such a thing we'd commit to it, which is not the appropriate place or project for it (we usually start to implement things like this in transformers first, since it will have to check for much wider usability).

The primary use case across HF would probably be this API.

Another potential larger use case is I can see how there also could be a utility method which enables "easy" token encoding that would enable developers to send in a prefix & suffix and get back a list of tokens which has been properly created with FIM tokens (as a bonus it could also add the BOS and EOS tokens as well).

Alternatively, if we want to support FIM models with this API (we sorta do if we want it to be most useful), we could only store the per model chat template in this project, not in tokenizer_config.json and worry about moving it out of the project later. So this would mean having a configuration file in this project which has a mapping from model -> chat template. I think there is some precedent for this since this project has a lot of per-model loading code.

Hi @spew thanks for the insights and feedback, we've given it some thought as a team and decided we're going to avoid supporting the completions template.

The tokenizer_config.json is defined/maintained outside of TGI, and adding TGI specific/custom values there does not seem like the best path forward. Additionally the template would have only defined how two strings the prompt and suffix were concatenated and this can be done by the caller with simple preprocessing. While we aim to align with existing tools/apis, it does not make sense to add this complexity/overhead for a legacy api.

Really appreciate you bringing this up, though! Always good to see different angles and possibilities, thank you!

Understood! It does mean this API will be less useful for FIM models which means it will not be a drop-in replacement for some (possibly many?) tools built on the legacy OpenAI completions API. This is cumbersome when you don't control the source of those tools but desire to move off of OpenAI to your own hosting.

IMO this defeats most of the value prop for the API -- but I don't have insight into all the internal reasoning.

Just do the prefix/suffix before sending the prompts. It's much more generic anyway.

but I don't have insight into all the internal reasoning.

You're more than welcome to fork this repo, and maintain this yourself. We just don't have the bandwidth for that.
If you want to contribute such a thing in the HF ecosystem, try implementing it into transformers first.

As said, this is the place where the normalization is created and maintained across our ecosytem for the configs.

Understood. It does seem like transformer level support for FIM models would make things a lot easier and less error prone so maybe there's an opportunity to do something there.

Narsil

LGTM

This WIP PR (will) add support for legacy OpenAI `v1/completions` API. This should allow TGI to be a drop in replacement for OpenAI when using tools that rely on the completions api Should fix: huggingface#1468

drbh mentioned this pull request Feb 2, 2024

Support the legacy openai 'completions' API interface #1468

Closed

drbh marked this pull request as ready for review February 2, 2024 23:15

drbh marked this pull request as draft February 2, 2024 23:17

drbh force-pushed the support-legacy-completions-api branch from ca73ca3 to d95bbd8 Compare February 19, 2024 17:23

drbh added 4 commits February 20, 2024 13:02

feat: accept legacy request format and response

cade8db

feat: return expected types and append suffix

fa9aad3

feat: rebase latest changes

0fb864e

feat: support fill in the middle templates

6a0f737

drbh force-pushed the support-legacy-completions-api branch from 64bd96e to 6a0f737 Compare February 20, 2024 18:15

drbh added 8 commits February 20, 2024 13:15

fix: add back imports after rebase

3d71a44

fix: improve import readability

10621a3

fix: prefer name completion over fill in middle

2d79e77

fix: adjust fmt issue

bd3d062

Merge branch 'main' into support-legacy-completions-api

07ac99a

fix: add missing imports

90d6330

feat: update docs refactor and avoid cuda graphs when unavailable

19c0248

fix: improve completion request params and comments

544f848

drbh marked this pull request as ready for review February 21, 2024 16:31

drbh added 3 commits February 21, 2024 11:32

fix: adjust typo

1724d06

fix: pre commit trailing whitespace typo

3e22cdd

fix: add back typo removed variable

66f8912

Narsil reviewed Feb 22, 2024

View reviewed changes

drbh added 2 commits February 26, 2024 15:19

feat: remove debug cuda avoid

de421dc

feat: deprecate suffix and completion template

5f85262

Narsil approved these changes Feb 29, 2024

View reviewed changes

drbh merged commit 3dd7da2 into main Feb 29, 2024
8 checks passed

drbh deleted the support-legacy-completions-api branch February 29, 2024 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: accept legacy request format and response #1527

feat: accept legacy request format and response #1527

drbh commented Feb 2, 2024

spew commented Feb 2, 2024 •

edited

Loading

drbh commented Feb 6, 2024

drbh commented Feb 6, 2024

spew commented Feb 6, 2024 •

edited

Loading

drbh commented Feb 20, 2024 •

edited

Loading

Narsil Feb 22, 2024

spew Feb 22, 2024

Narsil Feb 22, 2024

spew Feb 22, 2024 •

edited

Loading

drbh Feb 26, 2024

spew Feb 26, 2024 •

edited

Loading

Narsil Feb 29, 2024

spew Feb 29, 2024 •

edited

Loading

Narsil left a comment

feat: accept legacy request format and response #1527

feat: accept legacy request format and response #1527

Conversation

drbh commented Feb 2, 2024

spew commented Feb 2, 2024 • edited Loading

drbh commented Feb 6, 2024

drbh commented Feb 6, 2024

spew commented Feb 6, 2024 • edited Loading

drbh commented Feb 20, 2024 • edited Loading

Narsil Feb 22, 2024

Choose a reason for hiding this comment

spew Feb 22, 2024

Choose a reason for hiding this comment

Narsil Feb 22, 2024

Choose a reason for hiding this comment

spew Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

drbh Feb 26, 2024

Choose a reason for hiding this comment

spew Feb 26, 2024 • edited Loading

Choose a reason for hiding this comment

Narsil Feb 29, 2024

Choose a reason for hiding this comment

spew Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

spew commented Feb 2, 2024 •

edited

Loading

spew commented Feb 6, 2024 •

edited

Loading

drbh commented Feb 20, 2024 •

edited

Loading

spew Feb 22, 2024 •

edited

Loading

spew Feb 26, 2024 •

edited

Loading

spew Feb 29, 2024 •

edited

Loading