Synchromesh - Constrained Decoding from Language Models

This is an unofficial reimplementation of the Constrained Semantic Decoding (CSD) algorithm from the following paper:

@inproceedings{poesia2022synchromesh,
  title={Synchromesh: Reliable Code Generation from Pre-trained Language Models},
  author={Poesia*, Gabriel and Polozov*, Alex and Le, Vu and Tiwari, Ashish and Soares, Gustavo and Meek, Christopher and Gulwani, Sumit},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

CSD allows you to sample from a language model (e.g., LLaMA 2, or OpenAI models that support the Completions API) while respecting constraints coming from a Completion Engine. A Completion Engine is a very flexible abstraction over a left-to-right constraint generator. One possible completion engine given here is a grammar engine derived from Lark. Using a completion engine derived from a Lark grammar, you can sample from a language model while guaranteeing that the output will be parseable.

More instructions will be here soon. As the API stabilized, we'll also upload the package to PyPI. For now, you can install it locally with:

(base) $ python setup.py install

It's recommended to do this from within a conda or virtualenv environment.

Example

A simple example to get you started is in example.py. Here, we first create a Lark grammar encoding simple queries to a certain system about a college:

    college_grammar = r"""
        ?request: function " of " dept code
        function: "instructor" | "students" | "capacity" | "deptcode" | "school" | "college"
        dept:  /[A-Z]{3}/
        code: /[0-9]{3}/
    """
    # ...
    comp_engine = LarkCompletionEngine(college_grammar, 'request', False)

This initializes a completion engine based on this grammar, based on the request non-terminal. The language model will be constrained to generate something that can be generated by this grammar starting from this non-terminal (thus, "department of CS101" would be a valid string, whereas "department of CSE9191" would not).

Now, let's ask the model to complete the following prompt:

Paraphrase the following sentences
Human: who teaches CSE101?
Assistant:instructor of CSE101
Human: how many students can enroll in PSY456?
Assistant:capacity of PSY456
Human: what's the department of BIO433?
Assistant:

Note that, following our grammar, this query should be translated to "deptcode of BIO433". But the language model wouldn't know, of course, since we haven't described this in the prompt. Even if we did, there would be no guarantee that it would follow the instructions. This is especially true for smaller models.

The example uses CSD to sample 10 strings from an arbitrary language model while respecting the grammar. By default it will use GPT-2, but feel free to change it to your favorite model. One output we might get with GPT-2 is:

gpt2 prediction: school of SET202
gpt2 prediction: deptcode of BEO433
gpt2 prediction: students of PSY456
gpt2 prediction: deptcode of BIO433
gpt2 prediction: capacity of BIO433
gpt2 prediction: instructor of BIO433
gpt2 prediction: capacity of PSY433
gpt2 prediction: capacity of BIO433
gpt2 prediction: students of PSY432
gpt2 prediction: capacity of CPU433

Not all of these are good translations (in fact, even the classes are wrong in most of them!), but all of them are accepted by our grammar. Using LLaMA 2 7B, which is a significantly better model, and sampling with temparature 0.2, we instead get:

LLaMA2-7B prediction: school of BIO433
LLaMA2-7B prediction: deptcode of BIO433
LLaMA2-7B prediction: instructor of BIO433
LLaMA2-7B prediction: instructor of BIO433
LLaMA2-7B prediction: school of BIO433
LLaMA2-7B prediction: instructor of BIO433
LLaMA2-7B prediction: instructor of BIO433
LLaMA2-7B prediction: instructor of BIO433
LLaMA2-7B prediction: deptcode of BIO433
LLaMA2-7B prediction: school of BIO433

This is better: at least the class is correct in all of the predictions.

This example is rather adversarial for the language models, since both have a single token for "department", which they assign high probability to, but that is ruled out by CSD. For GPT-2, the correct prediction is tokenized as ["de", "pt", "code", " of", " B", "IO", "433"], and GPT-2 does not assign very high probability for the token "de" in this context. LLaMA has a yet different tokenization. While this is not how you would to this if the goal was to get the most accurate results, it does show that CSD will work around these tokenization intricacies and still give you outputs that are valid regardless of which language model you choose.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
synchromesh		synchromesh
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synchromesh - Constrained Decoding from Language Models

Example

About

Releases

Packages

Contributors 4

Languages

License

kanishkg/synchromesh

Folders and files

Latest commit

History

Repository files navigation

Synchromesh - Constrained Decoding from Language Models

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages