Unify Tokenizer Behavior and Ensure Sane Interfaces #936

lapp0 · 2024-06-01T00:31:05Z

What behavior of the library made you think about the improvement?

@brandonwillard and I were looking into the LlamaCppTokenizer and noticed a number of issues:

It's not made obvious that __getstate__ is used to serialize for hashing.
LlamaCppTokenizer and TransformerTokenizer are subclasses of outlines Tokenizer, but vLLM is not
LlamaCppTokenizer.__init__ doesn't load special_tokens
vLLM and transformers tokenizers use adapt_tokenizer, but llamacpp doesn't.
Tokenizers are intended to be immutable, but that isn't programmatically guaranteed.
The __hash__ and _stablehash(serialized) are calculated once per call rather than caching their hash value.

How would you like it to behave?

A lot of minor changes here. Please let me know if I'm missing something or if I've accidentally excluded something.

__getstate__ is a fallback for outlines.caching, and by default we implement _stablehash
vLLM becomes an outlines Tokenizer and uses the standard interfaces.
Good parameterized tests for all three tokenizers
outlines Tokenizer mutation is disabled
adapt_tokenizer is removed. All models pass themselves to their respective Tokenizer to be constructed.
_stablehash and __hash__ are only calculated once.
llamacpp tokenizer should have identical "batch decoding" behavior to the other tokens. link

Some of the work to fix this can be resurrected from #676

Status

On hold until ExLlamaV2 integration is complete (#807)

The text was updated successfully, but these errors were encountered:

lapp0 added the enhancement label Jun 1, 2024

lapp0 mentioned this issue Jun 1, 2024

Fix llamacpp caching by making LlamaCppTokenizer an outlines Tokenizer #929

Merged

lapp0 mentioned this issue Jun 11, 2024

Unify LogitsProcessors and outlines.generate Dispatchers #957

Open

This was referenced Jul 21, 2024

Implement prompt/generation alignment #531

Open

Update CFGGuide to use outlines.fsm.parsing. Enable generate.cfg #1067

Merged

lapp0 mentioned this issue Sep 28, 2024

Tokenizer Decode Integrity Testing #1179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify Tokenizer Behavior and Ensure Sane Interfaces #936

Unify Tokenizer Behavior and Ensure Sane Interfaces #936

lapp0 commented Jun 1, 2024 •

edited

Loading

Unify Tokenizer Behavior and Ensure Sane Interfaces #936

Unify Tokenizer Behavior and Ensure Sane Interfaces #936

Comments

lapp0 commented Jun 1, 2024 • edited Loading

What behavior of the library made you think about the improvement?

How would you like it to behave?

Status

lapp0 commented Jun 1, 2024 •

edited

Loading