Support for Constrained decoding #288

ojus1 · 2023-06-28T09:23:14Z

For getting structured outputs from custom-finetuned LLMs, extensive use of constrained decoding is standard.

Is there a plan to add support for DisjunctiveConstraint (and others) to vLLM in the near future?
How would one go about implementing this in vLLM (if I were to add a PR)?

zhuohan123 · 2023-06-28T15:28:43Z

Hi! We very much welcome you to contribute to this feature! I believe You can add this functionality by modifying the following places:

Add the related parameters to SamplingParams

vllm/vllm/sampling_params.py

Line 5 in bdd6b4c

class SamplingParams:
Implement the logic in Sampler

vllm/vllm/model_executor/layers/sampler.py

Line 15 in bdd6b4c

class Sampler(nn.Module):

To make our OpenAI frontend support this feature, add the related parameters to CompletionRequest

vllm/vllm/entrypoints/openai/protocol.py

Line 68 in bdd6b4c

class CompletionRequest(BaseModel):

and add the parameter here when initializing SamplingPramams here:

vllm/vllm/entrypoints/openai/api_server.py

Lines 130 to 142 in bdd6b4c

    
           sampling_params = SamplingParams( 
        
               n=request.n, 
        
               best_of=request.best_of, 
        
               presence_penalty=request.presence_penalty, 
        
               frequency_penalty=request.frequency_penalty, 
        
               temperature=request.temperature, 
        
               top_p=request.top_p, 
        
               top_k=request.top_k, 
        
               stop=request.stop, 
        
               ignore_eos=request.ignore_eos, 
        
               max_tokens=request.max_tokens, 
        
               logprobs=request.logprobs, 
        
               use_beam_search=request.use_beam_search,

zacharyblank · 2023-07-16T20:07:38Z

Curious if there's been any progress with this. I've hooked up Microsoft/Guidance and vLLM but the most powerful features aren't yet available because of missing features in vLLM.

Thank you!

viktor-ferenczi · 2023-09-29T06:03:37Z

Related to #535

viktor-ferenczi · 2023-09-30T07:10:21Z

Related topics:

JSON formatting issue #1191: Reliable JSON (or other structured data) generation
Integration with any of these libraries:

viktor-ferenczi · 2023-09-30T16:15:42Z

I'm going to implement this.

MaxZabarka · 2023-09-30T16:20:27Z

I'd like to help implement this aswell

viktor-ferenczi · 2023-09-30T16:28:30Z

The constraint may change during the generation. For example in case of #1191 it depends on what the JSON schema allows for the next token, but that depends on where the generation is currently in the schema. We cannot use the same constraint over the whole sequence in the general case. It must also work for beam search. How can we handle that efficiently via a REST API?

viktor-ferenczi · 2023-09-30T16:36:17Z

I think in case of the REST API we could allow for passing a formal description of the constraint in some generic and de facto standard format (if we can talk about it this soon) like guidance. It would allow for "compiling" the constraint inside the server and applying it to all generation of that sequence, including beam search.

In case of direct vLLM calls (from Python) we could let the user to pass a callback to process the logits before the token is chosen, so the probability of any unwanted tokens can be squashed to zero. It would be efficient and allow for using any algorithm. Then we could provide adapters for the above mentioned libraries.

viktor-ferenczi · 2023-09-30T17:33:49Z

Supporting the outlines library seems to be the best approach, because:

Outlines is compatible with all models. It only interfaces with models via the next-token logits. It can be used with API-based models as well.

While jsonformer is limited only to JSON and guidance does not have a clear way to integrate (it has spaghetti code).

MaxZabarka · 2023-09-30T18:41:33Z

In case of direct vLLM calls (from Python) we could let the user to pass a callback to process the logits before the token is chosen, so the probability of any unwanted tokens can be squashed to zero. It would be efficient and allow for using any algorithm. Then we could provide adapters for the above mentioned libraries.

This might be inefficient when generating structured data, for example a format like JSON, where a significant portion of the output consists of predetermined fields and symbols. Manipulating logits after a token is generated would be wasteful because we would already know what the majority of tokens are be before generation.

A feature of guidance is that it avoids running generation for tokens that are already known. Given that speed and efficiency is important to vLLM, how would we go about implementing something like this when integrating outlines or another framework?

viktor-ferenczi · 2023-10-01T09:43:37Z

Let's separate the two features:

Ability to constrain the token generated (manipulate logits before the token is chosen)
Ability to skip ahead if there is no choice between tokens (next token is dictated by a schema)

Since these features are largely independent I suggest implementing them in the above order.

viktor-ferenczi · 2023-10-01T17:44:21Z

Minimal prototype: #1243

viktor-ferenczi · 2023-10-17T01:55:03Z

This could be implemented by finishing LMQL integration.

Vokturz · 2023-10-19T15:59:53Z

As I understand, guidance uses the logit_bias parameter to work. Would it be this PR enough? #535

I haven't tested yet but I think this is the way

nullpointer0xffff · 2023-10-27T18:25:07Z

+1 to support logit_bias and allow libraries like guidance to utilize.
Though there's a workaround to use vLLM API Server to mock ChatGPT API and use guidance openAI client to call, the performance downgraded a lot compared with logit_bias enabled output control.

flexorRegev · 2023-11-07T22:05:52Z

2. Ability to skip ahead if there is no choice between tokens (next token is dictated by a schema)

How would you think about creating this? since the sampler is running only after the forward pass..
the logits selector is already implemented and merged by @noamgat

noamgat · 2023-11-08T06:44:49Z

LM Format Enforcer is a library that achieves JSON Schema decoding and supports vLLM.
There is already a sample notebook showing vLLM integration. It currently uses monkeypatching, that will be removed when the next vLLM version with the logits processing API will be released.

(Disclosure: I am the author of the library)

viktor-ferenczi · 2023-11-09T11:23:14Z

@noamgat Thank you very much, it is very useful.

Support via the vLLM REST API would still be great, because it would save the model loading time by using a continuously running server.

See also #1279

rlouf · 2023-11-24T18:48:48Z

Outlines author here. The PR dottxt-ai/outlines#366 will allow easy integration into vLLM. Estimated time of completion is next week.

See dottxt-ai/outlines#163 (comment) for a diagram that summarizes the new architecture. We can work together on the integration and finding the boundary that makes the most sense for both libraries.

pj-ml · 2024-01-09T13:04:19Z

@rlouf did you manage to make much progress yet?

rlouf · 2024-01-09T13:26:28Z

Yes: https://outlines-dev.github.io/outlines/reference/vllm/

More is coming (soon)!

viktor-ferenczi · 2024-01-11T07:16:28Z

Yes: https://outlines-dev.github.io/outlines/reference/vllm/

More is coming (soon)!

We need a similar solution integrated into vLLM by default.

I would suggest just porting over GBNF, since RegEx cannot be fully supported (also too complex) and JSON schema is too restrictive for simple use cases.

br3no · 2024-02-05T10:37:17Z

Outlines' reference implementation of the vLLM server (https://github.com/outlines-dev/outlines/blob/main/outlines/serve/serve.py) is a copy of vLLM's https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py with a few patches and add-ons.

I believe this code should rather live in vLLM instead of outlines. And there should be an analogous implementation of the OpenAI endpoint.

@viktor-ferenczi, do you think this is a promising path? I'd be willing to invest time into this.

viktor-ferenczi · 2024-02-06T15:48:00Z

I think it is something to be decided by the maintainers of the outlines and the vLLM projects.

Currently both projects are changing rapidly and have quite a few bugs, so maybe this is something to decide later as they stabilize.

I'm just a small contributor / user, not a decision maker here.

br3no · 2024-02-07T08:53:31Z

@viktor-ferenczi, fair enough.

@zhuohan123 and @rlouf, what is your assessment?

rlouf · 2024-02-07T11:02:13Z

I think it would make sense, vLLM benefits from structured generation and Outlines can re-focus on its main goals.

Added support for guided decoding in `api_server` by integrating _outlines_ (https://github.com/outlines-dev/outlines).

viktor-ferenczi · 2024-02-08T20:35:30Z

It would be nice to have constrained decoding out of the box, because how it goes right now I have to fix bugs to get it working with outlines after every single vLLM update. Just to see those fixes being deleted because of yet another round of changes.

scriptator · 2024-02-09T10:48:04Z

I just read about SGLang's approach for constrained decoding. Did you consider adding that to VLLM instead of Outlines? See for example this blog article: https://lmsys.org/blog/2024-02-05-compressed-fsm/

rlouf · 2024-02-09T12:56:44Z

SGLang's code was copied from Outlines', they just decided to import Outlines instead and implemented the change. See also this blog post that was published prior to theirs and explains the limits of a character-based approach.

simon-mo · 2024-03-19T22:46:11Z

We now support full range of constrained/guided decoding as powered by Outlines, closing this as completed

br3no · 2024-03-20T11:05:25Z

@simon-mo is there a different process to contribute documentation? Or should one just open a PR? I may have some time in three weeks…

simon-mo · 2024-03-20T17:14:34Z

PR welcomed! I added sparse documentation on this https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters but more examples appreciated!

@andy-neuma

Upstream sync 2024 06 08 (vllm-project#288) - ties to v0.4.3 of vllm-upstream SUMMARY: * Merge commits from vllm-project@f68470e to vllm-project@1197e02 * Our GCP test instances do not have `gcc` or `clang` installed. All of the triton kernels rely on the `gcc` and `clang` to generate JITs. I disabled these for now, but we need to get these installed (cc @andy-neuma). All are marked with: ```python @pytest.mark.skip("C compiler not installed in NM automation. " "This codepath follows a triton pathway, which " "JITs using clang or gcc. Since neither are installed " "in our test instances, we need to skip this for now.") ``` * Cherry-picked in the changes associated with Fp8 weight format from @mgoin Note that vllm-project@f68470e is NOT included in this merge. COMPARE vs UPSTREAM: * https://github.com/neuralmagic/nm-vllm/compare/upstream-sync-2024-06-08..vllm-project:vllm:v0.4.3 --------- Signed-off-by: kerthcet <kerthcet@gmail.com> Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Signed-off-by: pandyamarut <pandyamarut@gmail.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Wenwei Zhang <40779233+ZwwWayne@users.noreply.github.com> Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com> Co-authored-by: Mor Zusman <mor.zusmann@gmail.com> Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by: Kuntai Du <kuntai@uchicago.edu> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: HUANG Fei <hzhwcmhf@gmail.com> Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Kante Yin <kerthcet@gmail.com> Co-authored-by: sasha0552 <admin@sasha0552.org> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: raywanb <112235519+raywanb@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Letian Li <lotianmail@gmail.com> Co-authored-by: Murali Andoorveedu <37849411+andoorve@users.noreply.github.com> Co-authored-by: Dipika Sikka <ds3822@columbia.edu> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Elisei Smirnov <61423871+kezouke@users.noreply.github.com> Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: leiwen83 <leiwen83@users.noreply.github.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Eric Xihui Lin <xihuil.silence@gmail.com> Co-authored-by: beagleski <yunanzhang@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Barun Patra <codedecde@users.noreply.github.com> Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Michał Moskal <michal@moskal.me> Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Junichi Sato <junichi.sato@sbintuitions.co.jp> Co-authored-by: Marut Pandya <pandyamarut@gmail.com> Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Itay Etelis <92247226+Etelis@users.noreply.github.com> Co-authored-by: omkar kakarparthi <75638701+okakarpa@users.noreply.github.com> Co-authored-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Co-authored-by: Breno Faria <breno@veltefaria.de> Co-authored-by: Breno Faria <breno.faria@intrafind.com> Co-authored-by: Hyunsung Lee <ita9naiwa@gmail.com> Co-authored-by: Chansung Park <deep.diver.csp@gmail.com> Co-authored-by: SnowDist <quxingwei25@gmail.com> Co-authored-by: functionxu123 <1229853312@qq.com> Co-authored-by: xuhao <xuhao@cambricon.com>

zhuohan123 added good first issue Good for newcomers feature request labels Jun 28, 2023

viktor-ferenczi mentioned this issue Sep 29, 2023

JSON formatting issue #1191

Closed

viktor-ferenczi mentioned this issue Oct 1, 2023

Constrained decoding #1243

Draft

viktor-ferenczi mentioned this issue Nov 9, 2023

Attempt to pipe logit_bias to sampler's embedding_bias #1279

Closed

br3no added a commit to br3no/vllm that referenced this issue Feb 8, 2024

vllm-project#288 guided decoding

c1c8d39

Added support for guided decoding in `api_server` by integrating _outlines_ (https://github.com/outlines-dev/outlines).

br3no added a commit to br3no/vllm that referenced this issue Feb 8, 2024

Merge branch 'SUPPORT_GUIDED_DECODING-vllm-project#288' into main

04f1e19

br3no mentioned this issue Feb 8, 2024

Add support for guided decoding (fixes #288) #2815

Closed

felixzhu555 mentioned this issue Feb 9, 2024

Add guided decoding for OpenAI API server #2819

Merged

simon-mo closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Constrained decoding #288

Support for Constrained decoding #288

ojus1 commented Jun 28, 2023

zhuohan123 commented Jun 28, 2023

zacharyblank commented Jul 16, 2023

viktor-ferenczi commented Sep 29, 2023

viktor-ferenczi commented Sep 30, 2023

viktor-ferenczi commented Sep 30, 2023

MaxZabarka commented Sep 30, 2023

viktor-ferenczi commented Sep 30, 2023 •

edited

Loading

viktor-ferenczi commented Sep 30, 2023 •

edited

Loading

viktor-ferenczi commented Sep 30, 2023

MaxZabarka commented Sep 30, 2023

viktor-ferenczi commented Oct 1, 2023 •

edited

Loading

viktor-ferenczi commented Oct 1, 2023 •

edited

Loading

viktor-ferenczi commented Oct 17, 2023 •

edited

Loading

Vokturz commented Oct 19, 2023

nullpointer0xffff commented Oct 27, 2023

flexorRegev commented Nov 7, 2023 •

edited

Loading

noamgat commented Nov 8, 2023

viktor-ferenczi commented Nov 9, 2023 •

edited

Loading

rlouf commented Nov 24, 2023 •

edited

Loading

pj-ml commented Jan 9, 2024

rlouf commented Jan 9, 2024

viktor-ferenczi commented Jan 11, 2024

br3no commented Feb 5, 2024

viktor-ferenczi commented Feb 6, 2024

br3no commented Feb 7, 2024

rlouf commented Feb 7, 2024

viktor-ferenczi commented Feb 8, 2024 •

edited

Loading

scriptator commented Feb 9, 2024

rlouf commented Feb 9, 2024 •

edited

Loading

simon-mo commented Mar 19, 2024

br3no commented Mar 20, 2024

simon-mo commented Mar 20, 2024

Support for Constrained decoding #288

Support for Constrained decoding #288

Comments

ojus1 commented Jun 28, 2023

zhuohan123 commented Jun 28, 2023

zacharyblank commented Jul 16, 2023

viktor-ferenczi commented Sep 29, 2023

viktor-ferenczi commented Sep 30, 2023

viktor-ferenczi commented Sep 30, 2023

MaxZabarka commented Sep 30, 2023

viktor-ferenczi commented Sep 30, 2023 • edited Loading

viktor-ferenczi commented Sep 30, 2023 • edited Loading

viktor-ferenczi commented Sep 30, 2023

MaxZabarka commented Sep 30, 2023

viktor-ferenczi commented Oct 1, 2023 • edited Loading

viktor-ferenczi commented Oct 1, 2023 • edited Loading

viktor-ferenczi commented Oct 17, 2023 • edited Loading

Vokturz commented Oct 19, 2023

nullpointer0xffff commented Oct 27, 2023

flexorRegev commented Nov 7, 2023 • edited Loading

noamgat commented Nov 8, 2023

viktor-ferenczi commented Nov 9, 2023 • edited Loading

rlouf commented Nov 24, 2023 • edited Loading

pj-ml commented Jan 9, 2024

rlouf commented Jan 9, 2024

viktor-ferenczi commented Jan 11, 2024

br3no commented Feb 5, 2024

viktor-ferenczi commented Feb 6, 2024

br3no commented Feb 7, 2024

rlouf commented Feb 7, 2024

viktor-ferenczi commented Feb 8, 2024 • edited Loading

scriptator commented Feb 9, 2024

rlouf commented Feb 9, 2024 • edited Loading

simon-mo commented Mar 19, 2024

br3no commented Mar 20, 2024

simon-mo commented Mar 20, 2024

viktor-ferenczi commented Sep 30, 2023 •

edited

Loading

viktor-ferenczi commented Sep 30, 2023 •

edited

Loading

viktor-ferenczi commented Oct 1, 2023 •

edited

Loading

viktor-ferenczi commented Oct 1, 2023 •

edited

Loading

viktor-ferenczi commented Oct 17, 2023 •

edited

Loading

flexorRegev commented Nov 7, 2023 •

edited

Loading

viktor-ferenczi commented Nov 9, 2023 •

edited

Loading

rlouf commented Nov 24, 2023 •

edited

Loading

viktor-ferenczi commented Feb 8, 2024 •

edited

Loading

rlouf commented Feb 9, 2024 •

edited

Loading