Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Constrained decoding #288

Closed
ojus1 opened this issue Jun 28, 2023 · 32 comments
Closed

Support for Constrained decoding #288

ojus1 opened this issue Jun 28, 2023 · 32 comments
Labels

Comments

@ojus1
Copy link

ojus1 commented Jun 28, 2023

For getting structured outputs from custom-finetuned LLMs, extensive use of constrained decoding is standard.

Is there a plan to add support for DisjunctiveConstraint (and others) to vLLM in the near future?
How would one go about implementing this in vLLM (if I were to add a PR)?

@zhuohan123
Copy link
Collaborator

Hi! We very much welcome you to contribute to this feature! I believe You can add this functionality by modifying the following places:

  1. Add the related parameters to SamplingParams
    class SamplingParams:
  2. Implement the logic in Sampler
    class Sampler(nn.Module):
  3. To make our OpenAI frontend support this feature, add the related parameters to CompletionRequest
    class CompletionRequest(BaseModel):
    and add the parameter here when initializing SamplingPramams here:
    sampling_params = SamplingParams(
    n=request.n,
    best_of=request.best_of,
    presence_penalty=request.presence_penalty,
    frequency_penalty=request.frequency_penalty,
    temperature=request.temperature,
    top_p=request.top_p,
    top_k=request.top_k,
    stop=request.stop,
    ignore_eos=request.ignore_eos,
    max_tokens=request.max_tokens,
    logprobs=request.logprobs,
    use_beam_search=request.use_beam_search,

@zacharyblank
Copy link

Curious if there's been any progress with this. I've hooked up Microsoft/Guidance and vLLM but the most powerful features aren't yet available because of missing features in vLLM.

Thank you!

@viktor-ferenczi
Copy link
Contributor

Related to #535

@viktor-ferenczi
Copy link
Contributor

Related topics:

@viktor-ferenczi
Copy link
Contributor

I'm going to implement this.

@MaxZabarka
Copy link

I'd like to help implement this aswell

@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Sep 30, 2023

The constraint may change during the generation. For example in case of #1191 it depends on what the JSON schema allows for the next token, but that depends on where the generation is currently in the schema. We cannot use the same constraint over the whole sequence in the general case. It must also work for beam search. How can we handle that efficiently via a REST API?

@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Sep 30, 2023

I think in case of the REST API we could allow for passing a formal description of the constraint in some generic and de facto standard format (if we can talk about it this soon) like guidance. It would allow for "compiling" the constraint inside the server and applying it to all generation of that sequence, including beam search.

In case of direct vLLM calls (from Python) we could let the user to pass a callback to process the logits before the token is chosen, so the probability of any unwanted tokens can be squashed to zero. It would be efficient and allow for using any algorithm. Then we could provide adapters for the above mentioned libraries.

@viktor-ferenczi
Copy link
Contributor

Supporting the outlines library seems to be the best approach, because:

Outlines is compatible with all models. It only interfaces with models via the next-token logits. It can be used with API-based models as well.

While jsonformer is limited only to JSON and guidance does not have a clear way to integrate (it has spaghetti code).

@MaxZabarka
Copy link

In case of direct vLLM calls (from Python) we could let the user to pass a callback to process the logits before the token is chosen, so the probability of any unwanted tokens can be squashed to zero. It would be efficient and allow for using any algorithm. Then we could provide adapters for the above mentioned libraries.

This might be inefficient when generating structured data, for example a format like JSON, where a significant portion of the output consists of predetermined fields and symbols. Manipulating logits after a token is generated would be wasteful because we would already know what the majority of tokens are be before generation.

A feature of guidance is that it avoids running generation for tokens that are already known. Given that speed and efficiency is important to vLLM, how would we go about implementing something like this when integrating outlines or another framework?

@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Oct 1, 2023

Let's separate the two features:

  1. Ability to constrain the token generated (manipulate logits before the token is chosen)
  2. Ability to skip ahead if there is no choice between tokens (next token is dictated by a schema)

Since these features are largely independent I suggest implementing them in the above order.

@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Oct 1, 2023

Minimal prototype: #1243

@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Oct 17, 2023

This could be implemented by finishing LMQL integration.

@Vokturz
Copy link

Vokturz commented Oct 19, 2023

As I understand, guidance uses the logit_bias parameter to work. Would it be this PR enough? #535

I haven't tested yet but I think this is the way

@nullpointer0xffff
Copy link

+1 to support logit_bias and allow libraries like guidance to utilize.
Though there's a workaround to use vLLM API Server to mock ChatGPT API and use guidance openAI client to call, the performance downgraded a lot compared with logit_bias enabled output control.

@flexorRegev
Copy link

flexorRegev commented Nov 7, 2023

2. Ability to skip ahead if there is no choice between tokens (next token is dictated by a schema)

How would you think about creating this? since the sampler is running only after the forward pass..
the logits selector is already implemented and merged by @noamgat

@noamgat
Copy link
Contributor

noamgat commented Nov 8, 2023

LM Format Enforcer is a library that achieves JSON Schema decoding and supports vLLM.
There is already a sample notebook showing vLLM integration. It currently uses monkeypatching, that will be removed when the next vLLM version with the logits processing API will be released.

(Disclosure: I am the author of the library)

@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Nov 9, 2023

@noamgat Thank you very much, it is very useful.

Support via the vLLM REST API would still be great, because it would save the model loading time by using a continuously running server.

See also #1279

@rlouf
Copy link

rlouf commented Nov 24, 2023

Outlines author here. The PR dottxt-ai/outlines#366 will allow easy integration into vLLM. Estimated time of completion is next week.

See dottxt-ai/outlines#163 (comment) for a diagram that summarizes the new architecture. We can work together on the integration and finding the boundary that makes the most sense for both libraries.

@pj-ml
Copy link

pj-ml commented Jan 9, 2024

@rlouf did you manage to make much progress yet?

@rlouf
Copy link

rlouf commented Jan 9, 2024

Yes: https://outlines-dev.github.io/outlines/reference/vllm/

More is coming (soon)!

@viktor-ferenczi
Copy link
Contributor

Yes: https://outlines-dev.github.io/outlines/reference/vllm/

More is coming (soon)!

We need a similar solution integrated into vLLM by default.

I would suggest just porting over GBNF, since RegEx cannot be fully supported (also too complex) and JSON schema is too restrictive for simple use cases.

@br3no
Copy link
Contributor

br3no commented Feb 5, 2024

Outlines' reference implementation of the vLLM server (https://github.com/outlines-dev/outlines/blob/main/outlines/serve/serve.py) is a copy of vLLM's https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py with a few patches and add-ons.

I believe this code should rather live in vLLM instead of outlines. And there should be an analogous implementation of the OpenAI endpoint.

@viktor-ferenczi, do you think this is a promising path? I'd be willing to invest time into this.

@viktor-ferenczi
Copy link
Contributor

I think it is something to be decided by the maintainers of the outlines and the vLLM projects.

Currently both projects are changing rapidly and have quite a few bugs, so maybe this is something to decide later as they stabilize.

I'm just a small contributor / user, not a decision maker here.

@br3no
Copy link
Contributor

br3no commented Feb 7, 2024

@viktor-ferenczi, fair enough.

@zhuohan123 and @rlouf, what is your assessment?

@rlouf
Copy link

rlouf commented Feb 7, 2024

I think it would make sense, vLLM benefits from structured generation and Outlines can re-focus on its main goals.

br3no added a commit to br3no/vllm that referenced this issue Feb 8, 2024
Added support for guided decoding in `api_server` by integrating _outlines_ (https://github.com/outlines-dev/outlines).
@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Feb 8, 2024

It would be nice to have constrained decoding out of the box, because how it goes right now I have to fix bugs to get it working with outlines after every single vLLM update. Just to see those fixes being deleted because of yet another round of changes.

@scriptator
Copy link

I just read about SGLang's approach for constrained decoding. Did you consider adding that to VLLM instead of Outlines? See for example this blog article: https://lmsys.org/blog/2024-02-05-compressed-fsm/

@rlouf
Copy link

rlouf commented Feb 9, 2024

SGLang's code was copied from Outlines', they just decided to import Outlines instead and implemented the change. See also this blog post that was published prior to theirs and explains the limits of a character-based approach.

@simon-mo
Copy link
Collaborator

We now support full range of constrained/guided decoding as powered by Outlines, closing this as completed

@br3no
Copy link
Contributor

br3no commented Mar 20, 2024

@simon-mo is there a different process to contribute documentation? Or should one just open a PR? I may have some time in three weeks…

@simon-mo
Copy link
Collaborator

PR welcomed! I added sparse documentation on this https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters but more examples appreciated!

yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024
Upstream sync 2024 06 08 (vllm-project#288) - ties to v0.4.3 of vllm-upstream

SUMMARY:
* Merge commits from
vllm-project@f68470e
to
vllm-project@1197e02
* Our GCP test instances do not have `gcc` or `clang` installed. All of
the triton kernels rely on the `gcc` and `clang` to generate JITs. I
disabled these for now, but we need to get these installed (cc
@andy-neuma). All are marked with:

```python
@pytest.mark.skip("C compiler not installed in NM automation. "
                  "This codepath follows a triton pathway, which "
                  "JITs using clang or gcc. Since neither are installed "
                  "in our test instances, we need to skip this for now.")
```
* Cherry-picked in the changes associated with Fp8 weight format from
@mgoin

Note that
vllm-project@f68470e
is NOT included in this merge.

COMPARE vs UPSTREAM:
*
https://github.com/neuralmagic/nm-vllm/compare/upstream-sync-2024-06-08..vllm-project:vllm:v0.4.3

---------

Signed-off-by: kerthcet <kerthcet@gmail.com>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Wenwei Zhang <40779233+ZwwWayne@users.noreply.github.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
Co-authored-by: Mor Zusman <mor.zusmann@gmail.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com>
Co-authored-by: Kuntai Du <kuntai@uchicago.edu>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: HUANG Fei <hzhwcmhf@gmail.com>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Kante Yin <kerthcet@gmail.com>
Co-authored-by: sasha0552 <admin@sasha0552.org>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: raywanb <112235519+raywanb@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Letian Li <lotianmail@gmail.com>
Co-authored-by: Murali Andoorveedu <37849411+andoorve@users.noreply.github.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Elisei Smirnov <61423871+kezouke@users.noreply.github.com>
Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: leiwen83 <leiwen83@users.noreply.github.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Eric Xihui Lin <xihuil.silence@gmail.com>
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Michał Moskal <michal@moskal.me>
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Co-authored-by: Marut Pandya <pandyamarut@gmail.com>
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Co-authored-by: omkar kakarparthi <75638701+okakarpa@users.noreply.github.com>
Co-authored-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Co-authored-by: Breno Faria <breno@veltefaria.de>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
Co-authored-by: Hyunsung Lee <ita9naiwa@gmail.com>
Co-authored-by: Chansung Park <deep.diver.csp@gmail.com>
Co-authored-by: SnowDist <quxingwei25@gmail.com>
Co-authored-by: functionxu123 <1229853312@qq.com>
Co-authored-by: xuhao <xuhao@cambricon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests