Add llama logits processor #556

dtiarks · 2024-01-18T18:08:44Z

Note that this is super early.
We should probably provide a core implementation that all the other processors could inherit. Also the implementation of the tokenizer seems to be redundant.

Looking forward to some input.

examples/llamacpp_processor.py

rlouf · 2024-01-20T11:55:24Z

I think we should use a hybrid approach where we subclass Llama to be able to get the logits (useful for debugging), choose the sampling method and give this the same interface as the loop that is used for transformers models. What do you think?

dtiarks · 2024-01-22T13:36:31Z

Sounds good. I'm going to move the logit processor into the integration file.

lapp0

IMHO, we should have an abstract implementation of RegexLogitsProcessor, CFGLogitsProcessor, and JSONLogitsProcessor which handles the state management logic, then subclass for specific interfacing with the inference library.

outlines/models/llamacpp.py

dtiarks · 2024-01-24T10:02:19Z

Jip that's a good idea...

outlines/models/llamacpp.py

rlouf · 2024-01-26T11:49:22Z

I updated the code in PR to give you a better idea of the interface I'm thinking about. There are still several things to think about / do:

We need to add a stream method to the LlamaCpp class. Signature should be the same as SequenceGenerator.stream
We can use Llama.__call__ method, and make sure that the signature is identical to SequenceGenerator.__call__
There is potential for simplification: JSONLogitsProcessor is redundant with RegexLogitsProcessor
For JSON-structured generation we need to format the output so we have the same interface as the one we have with transformers models. We will need to add a format_sequence method to LlamaCpp and apply it to the output of __call__. This way we could also remove the need for the json_llamacpp function.
If llama.cpp provides tiny models just like transformers does we should add integration tests.

dtiarks · 2024-02-03T13:36:28Z

Finalize integration tests
Clarify if we need multi prompt streaming in the first iteration (@rlouf )
Fix failing integration tests (llama-cpp 0.2.30 works fine, latest doesn't)

rlouf · 2024-02-06T09:55:13Z

We can forget about multi-prompt streaming in the first iteration, raise an informative NotImplementedError when a user asks for it, and open an issue for this.

outlines/generate/text.py

lapp0 · 2024-02-09T15:47:11Z

Hi @dtiarks, thanks for implementing this! Could you let me know what work remains and if there's any way I could help?

dtiarks · 2024-02-09T15:49:32Z

This is essentially ready for reviewing. I talked with @rlouf in private about it assuming he would review it.

rlouf · 2024-02-09T15:53:57Z

I am going to as soon as I have time!

rlouf

I am currently simplifying a few things in the code. Please wait until I'm done to make changes.

outlines/generate/processors.py

rlouf · 2024-02-12T21:19:29Z

examples/llamacpp_example.py

@@ -30,8 +30,7 @@ class Character(BaseModel):


 if __name__ == "__main__":
-    # Download model from https://huggingface.co/TheBloke/phi-2-GGUF
-    model = outlines.models.llamacpp("./phi-2.Q3_K_M.gguf", device="cpu")
+    model = outlines.models.llamacpp("./my_model.gguf")


Even better in the documentation

rlouf · 2024-02-12T22:23:48Z

I simplified the code in this PR, among other things removing all the tokenizer code that is not needed anymore. Since the code in outlines.generator.processors does not generalize I chose to leave it out of the model.

dtiarks · 2024-02-13T07:59:18Z

I simplified the code in this PR, among other things removing all the tokenizer code that is not needed anymore. Since the code in outlines.generator.processors does not generalize I chose to leave it out of the model.

All the processor I looked at so far (transformers, vLLM, TensorRT-LLM etc.) have a similar interface but not exactly the same. So it makes sense to have a model specific implementation for each.

lapp0 · 2024-02-13T15:55:56Z

outlines/generate/regex.py

@@ -3,6 +3,7 @@
 from outlines.fsm.fsm import RegexFSM
 from outlines.generate.api import SequenceGenerator
 from outlines.models import OpenAI
+from outlines.models.llamacpp import LlamaCpp, RegexLogitsProcessor


We should import within def regex_llamacpp, otherwise using outlines.generate.regex and outlines.generate.json will require installing llama_cpp_python

We register on the LlamaCpp type which is a subclass of Llama, so there's clearly a design issue here

rlouf · 2024-02-13T17:18:48Z

I rebased the branch on main and fixed the merge conflicts commit-by-commit so there may be some errors popping up. Please rebase on main instead of adding a merge commit in the future to preserve the history.

We still need to move a few things around, as noted @lapp0 the current code makes it necessary to install llama-cpp-python regardless of whether we're using llama.cpp or not.

rlouf · 2024-02-13T18:01:23Z

I found a workaround, can someone review before we merge?

outlines/fsm/json_schema.py

lapp0 · 2024-02-13T19:55:14Z

outlines/generate/regex.py

+        )
+
+    logits_processor = RegexLogitsProcessor(regex_str, model.tokenizer)
+    model.logits_processor = logits_processor


Does it make sense to bind the logits processor to the model? Won't this have side effects for other generations?

But a new generation would bind a different logit processor?

Except for outlines.generate.text. Perhaps we can just delete the attribute if it exists in outlines.generate.text to get it out the door, and I can address the design in a separate PR, refactoring with the use of abstract logits processor that have cleaner logic?

I think it's fine for now given how __call__ works, we may revisit this in the future though.

lapp0 · 2024-02-13T20:00:18Z

outlines/models/llamacpp.py

+        for prompt in prompts:
+            processors = []
+            if self.logits_processor is not None:
+                processors = [copy.copy(self.logits_processor)]


Won't the logits processors share a CFGFSM if we shallow copy?

True, logits processors should have a copy method.

We should probably open an issue for this.

lapp0 · 2024-02-13T20:05:47Z

outlines/models/llamacpp.py

+
+        formatted = [self.format_sequence(sequence) for sequence in results]
+
+        return formatted if len(formatted) > 1 else formatted[0]


nit:

IMO, List[prompt] should always correspond to List[generation]. I think we should condition on the passed prompts type rather than the length of formatted.

outlines/models/llamacpp.py

pyproject.toml

rlouf · 2024-02-15T18:08:36Z

@dtiarks You can take a last look

rlouf · 2024-02-16T10:23:31Z

I added a copy method to LogitsProcessor that uses the FSMs' copy method. Fixes #672.

rlouf · 2024-02-16T10:35:17Z

Merging, great job everyone!

this is introduces since version 0.32 with this pull request outlines-dev/outlines#556 It changes the function name from `build_regex_from_object` to `build_regex_from_schema` This leads to an error in newer docker containers when starting tgi.

rlouf mentioned this pull request Jan 18, 2024

Modify llamacpp to process multiple invocations #548

Closed

This was linked to issues Jan 18, 2024

Add high level llama-cpp-python logits processor #540

Closed

outlines does not work with gguf models. #550

Closed

llamacpp (gguf) generation looks "wrong" #553

Closed

rlouf reviewed Jan 19, 2024

View reviewed changes

examples/llamacpp_processor.py Outdated Show resolved Hide resolved

rlouf force-pushed the llama-logits-processor branch from 1cf20db to a8d24fe Compare January 22, 2024 22:08

dtiarks changed the title ~~[WIP] Add llama logits processor~~ Add llama logits processor Jan 23, 2024

dtiarks marked this pull request as ready for review January 23, 2024 11:03

lapp0 reviewed Jan 23, 2024

View reviewed changes

outlines/models/llamacpp.py Outdated Show resolved Hide resolved

rlouf reviewed Jan 24, 2024

View reviewed changes

outlines/models/llamacpp.py Outdated Show resolved Hide resolved

rlouf reviewed Jan 24, 2024

View reviewed changes

outlines/models/llamacpp.py Outdated Show resolved Hide resolved

rlouf force-pushed the llama-logits-processor branch 2 times, most recently from ff5abb8 to adeced3 Compare January 26, 2024 11:42

rlouf force-pushed the llama-logits-processor branch from adeced3 to b20d502 Compare January 26, 2024 12:40

dtiarks mentioned this pull request Jan 26, 2024

Incorporate logit processors into generation API #571

Closed

rlouf added enhancement llama.cpp Related to the `llama.cpp` integration labels Jan 26, 2024

rlouf force-pushed the llama-logits-processor branch from b20d502 to 57d6b03 Compare January 26, 2024 15:53

rlouf reviewed Feb 6, 2024

View reviewed changes

outlines/generate/text.py Outdated Show resolved Hide resolved

rlouf mentioned this pull request Feb 8, 2024

JSON schema creation on llama cpp failing #611

Closed

rlouf force-pushed the llama-logits-processor branch from 1cd7fde to b31d731 Compare February 10, 2024 22:19

rlouf requested changes Feb 12, 2024

View reviewed changes

rlouf force-pushed the llama-logits-processor branch from 48abe80 to befd327 Compare February 12, 2024 22:17

rlouf force-pushed the llama-logits-processor branch from befd327 to 5cadc1e Compare February 12, 2024 22:23

rlouf linked an issue Feb 12, 2024 that may be closed by this pull request

LlamaCpp on Mac generates nonsense #648

Closed

Remove unused kv_cache argument to public API

0164e01

lapp0 reviewed Feb 13, 2024

View reviewed changes

rlouf force-pushed the llama-logits-processor branch from a7e14e0 to 78acab4 Compare February 13, 2024 17:13

rlouf force-pushed the llama-logits-processor branch 2 times, most recently from 11ab047 to 726ec24 Compare February 13, 2024 17:59

isamu-isozaki mentioned this pull request Feb 13, 2024

Issue with llamacpp version larger than 0.2.37 noamgat/lm-format-enforcer#72

Closed

lapp0 reviewed Feb 13, 2024

View reviewed changes

lapp0 mentioned this pull request Feb 14, 2024

Support for TensorRT-LLM #632

Open

rlouf force-pushed the llama-logits-processor branch from 726ec24 to 9df92df Compare February 15, 2024 18:07

rlouf force-pushed the llama-logits-processor branch 4 times, most recently from 1e220e1 to 743b054 Compare February 16, 2024 10:15

rlouf linked an issue Feb 16, 2024 that may be closed by this pull request

Add deep copy method to llama logit processors #672

Closed

rlouf force-pushed the llama-logits-processor branch from 743b054 to 1079903 Compare February 16, 2024 10:24

Integrate llama.cpp via a logits processor

1079903

rlouf merged commit e99d92d into outlines-dev:main Feb 16, 2024
5 checks passed

joennlae mentioned this pull request Mar 19, 2024

chore: update outlines function call to new name huggingface/text-generation-inference#1653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama logits processor #556

Add llama logits processor #556

dtiarks commented Jan 18, 2024 •

edited

Loading

rlouf commented Jan 20, 2024 •

edited

Loading

dtiarks commented Jan 22, 2024

lapp0 left a comment •

edited

Loading

dtiarks commented Jan 24, 2024

rlouf commented Jan 26, 2024

dtiarks commented Feb 3, 2024 •

edited

Loading

rlouf commented Feb 6, 2024 •

edited

Loading

lapp0 commented Feb 9, 2024

dtiarks commented Feb 9, 2024

rlouf commented Feb 9, 2024

rlouf left a comment

rlouf Feb 12, 2024

rlouf commented Feb 12, 2024

dtiarks commented Feb 13, 2024

lapp0 Feb 13, 2024

rlouf Feb 13, 2024

rlouf commented Feb 13, 2024

rlouf commented Feb 13, 2024

lapp0 Feb 13, 2024

rlouf Feb 15, 2024

lapp0 Feb 15, 2024

rlouf Feb 15, 2024

lapp0 Feb 13, 2024

rlouf Feb 13, 2024

dtiarks Feb 16, 2024

lapp0 Feb 13, 2024

rlouf commented Feb 15, 2024

rlouf commented Feb 16, 2024

rlouf commented Feb 16, 2024


		formatted = [self.format_sequence(sequence) for sequence in results]

		return formatted if len(formatted) > 1 else formatted[0]

Add llama logits processor #556

Add llama logits processor #556

Conversation

dtiarks commented Jan 18, 2024 • edited Loading

rlouf commented Jan 20, 2024 • edited Loading

dtiarks commented Jan 22, 2024

lapp0 left a comment • edited Loading

Choose a reason for hiding this comment

dtiarks commented Jan 24, 2024

rlouf commented Jan 26, 2024

dtiarks commented Feb 3, 2024 • edited Loading

rlouf commented Feb 6, 2024 • edited Loading

lapp0 commented Feb 9, 2024

dtiarks commented Feb 9, 2024

rlouf commented Feb 9, 2024

rlouf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlouf commented Feb 12, 2024

dtiarks commented Feb 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlouf commented Feb 13, 2024

rlouf commented Feb 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlouf commented Feb 15, 2024

rlouf commented Feb 16, 2024

rlouf commented Feb 16, 2024

dtiarks commented Jan 18, 2024 •

edited

Loading

rlouf commented Jan 20, 2024 •

edited

Loading

lapp0 left a comment •

edited

Loading

dtiarks commented Feb 3, 2024 •

edited

Loading

rlouf commented Feb 6, 2024 •

edited

Loading