Work on the BPE tokenizer #3252

goerch · 2023-09-18T17:25:07Z

vocabulary conversion changed
necessary Unicode support added
neccessary code from examples/gptneox-wip migrated
new tokenizer character test added for Falcon-7B

Tokenizer tests seem to work.

Tokenizer tests work for Falcon-7B

cebtenzzre · 2023-09-18T18:06:20Z

FYI, the MSVC build failed.

staviq · 2023-09-18T19:00:30Z

That cuda CI is still slow because it is still fetching full cuda installer from GitHub cache, Green-Sky said he's gonna try clearing the cache sometime in the future :)

You can bypass that, if you have access to a windows machine, from your GitHub account settings/actions/runners get a self hosted runner script, and after you run it, give it a "windows-latest" tag, then, if you run the CI manually from your own repo ( fork ) it will run on your machine, without cache, and that pipeline takes ~4min.

Green-Sky · 2023-09-18T20:53:29Z

@staviq i went ahead and manually purged the caches anyway, if they still are created, i will have to take a look at the ci script again.

staviq · 2023-09-18T21:32:03Z

@staviq i went ahead and manually purged the caches anyway, if they still are created, i will have to take a look at the ci script again.

Looks like correct ones just got cached.

KerfuffleV2 · 2023-09-19T07:17:08Z

I tried testing with https://huggingface.co/beomi/llama-2-ko-7b

Unfortunately, we immediately die at

llama.cpp/llama.cpp

Lines 1875 to 1880 in 8781013

    
           // determine the newline token: LLaMA "<0x0A>" == 10 == '\n', Falcon 193 == '\n' 
        
           if (vocab.type == LLAMA_VOCAB_TYPE_SPM) { 
        
               vocab.linefeed_id = llama_byte_to_token(vocab, '\n'); 
        
           } else { 
        
               vocab.linefeed_id = llama_tokenize_internal(vocab, "\n", false)[0]; 
        
           }

If I hard code line 1879 to just set it to 13 (which seems to be correct for the model) it runs into another byte not found.

There's a very good chance the model vocab isn't getting converted correctly. That model also requires manually generating the vocab.json from tokenizer.json (trivial Python script can do this). Aside from the extra Korean stuff it seems like the vocab is the same as a normal LLaMA model. I.E. the start of vocab.json:

{"<unk>": 0, "<s>": 1, "</s>": 2, "<0x00>": 3, "<0x01>": 4, "<0x02>": 5, "<0x03>": 6, "<0x04>": 7

I also tried https://huggingface.co/BAAI/Aquila-7B

Basically the same issues. Also, as far as I can see, this one actually doesn't even have a newline in its vocabulary. It has a less LLaMA-like vocab. The byte stuff at the beginning doesn't cover the whole ASCII range like Falcon though. I remember screwing around with this and a hacked version of the Falcon UTF generator stuff in its conversion script.

I compared the with master (just with the Aquila model) and I can get some kind of output with the newline hardcode fix which wasn't possible in this pull. It can't find space ( ) in the vocab so there are no spaces between words. It also can't seem to tokenize Chinese characters.

I hate to say it, but at least in terms of how close it visibly seems to working, this pull seems further away than master since I can't get either model to run at all. One difference seems to be that ERROR: byte not found in vocab just seems to be a warning in master while it's a hard error in this pull.

goerch · 2023-09-19T07:30:21Z

@KerfuffleV2 : thanks for taking the time!

I tried testing with https://huggingface.co/beomi/llama-2-ko-7b

Did you use an old conversion or a conversion built by convert.py? We need the original vocabulary for this PR to work. For my Falcon tests I still used and adapted convert-falcon-hf-to-gguf.py (I seem to remember problems when using convert.py).

I also tried https://huggingface.co/BAAI/Aquila-7B

OK, I have this laying around from earlier tests. I will check this out next.

KerfuffleV2 · 2023-09-19T07:33:26Z

Did you use an old conversion or a conversion built by convert.py?

I converted using convert.py (since these are LLaMA models) but did a fresh conversion today.

edit:

We need the original vocabulary for this PR to work.

I'm not completely sure what you mean here. I converted from HuggingFace model so that would have the original vocabulary, right? Or did you mean something different?

goerch · 2023-09-19T07:35:20Z

I converted using convert.py (since these are LLaMA models) but did a fresh conversion today.

OK, but aren't they using 'sentencepiece' tokenization then? BPE is for GPT2 compatible tokenization, AFAIU?

Or are we testing the special tokens problem here?

KerfuffleV2 · 2023-09-19T07:40:25Z

OK, but aren't they using 'sentencepiece' tokenization then?

From tokenizer.cfg in both of those models:

"type": "BPE",

So they're at least claiming to be BPE? Aquila also has "tokenizer_class": "GPT2Tokenizer", in tokenizer_config.json

By the way, when I converted using convert.py I specified --vocabtype bpe. This sets the tokenizer model in the GGUF file to gpt2 in addition to converting the vocab in the BPE way.

Or are we testing the special tokens problem here?

I don't think so. (I'm far from an expert on this stuff though so what I think might not be useful.)

goerch · 2023-09-19T07:48:28Z

OK, but aren't they using 'sentencepiece' tokenization then?

From tokenizer.cfg in both of those models:
"type": "BPE",
So they're at least claiming to be BPE? Aquila also has "tokenizer_class": "GPT2Tokenizer", in tokenizer_config.json

Wow, LLaMa based models with a different tokenization algorithm? Very interesting, I have no idea how this works.

By the way, when I converted using convert.py I specified --vocabtype bpe. This sets the tokenizer model in the GGUF file to gpt2 in addition to converting the vocab in the BPE way.

Yes, but convert.py doesn't contain the vocabulary fix of convert-falcon-hf-to-gguf.py (yet).

Or are we testing the special tokens problem here?

I don't think so. (I'm far from an expert on this stuff though so what I think might not be useful.)

It is very helpful, thanks!

Edit: Ah, I think I understand: they are only using the LlaMa transformer architecture.

KerfuffleV2 · 2023-09-19T08:06:09Z

I have no idea how this works.

Join the club. :) Aquila vocab looks closer to stuff like Falcon than the KR model which seemed to mostly be the same as normal LLaMA except for extra tokens. Maybe it's not actually supposed to be BPE but I think I tried to converting with SPM and it didn't really work either.

Yes, but convert.py doesn't contain the vocabulary fix

I wasn't really happy with #2938 but I figured it was at least better than nothing. At that point, I wasn't sure how to improve it and from GG said it seemed like the C++ side wasn't there yet so it wouldn't matter even if the conversion was perfect.

If you looked at the vocab for Aquila, does it seem the Falcon vocab fix would solve the issue? One thing I noticed is that byte vocab at the beginning didn't cover the range bytes_to_unicode in the Falcon converter uses, so it can't be used without modifications. (Not sure if that function was part of what you're talking about.)

goerch · 2023-09-19T08:42:16Z

Regarding #2938: scores are used by find_bpe_rank , the original tokenizer uses the merges to construct them (which we currently don't).

Adapted convert.py
Added a test for Aquila

Edit: I was mistaken : computing the ranks seems to be OK in SpecialVocab.

goerch · 2023-09-19T09:32:43Z

@klosax : do these fixes to GPT2 tokenization help with other open issues?

KerfuffleV2 · 2023-09-19T10:56:11Z

Is there anything you want me to test again at this point?

... because everyone is using it.

goerch · 2023-09-19T11:32:50Z

Is there anything you want me to test again at this point?

I just run inference with an convert.py-ed Aquila and moved byte decoding to token_to_piece, because only the tests use the detokenizers. Output looks halfway reasonable too me. Would be nice to have feed back from your side regarding Aquila and llama-2-ko-7b. I'll check inference with Falcon next.

Edit: Falcon looks fine too me.

cebtenzzre · 2023-10-03T17:04:19Z

But the token type we need in token_to_piece to suppress tokens like ´BOSandEOS`, I believe.

This PR explicitly writes every token type as 'normal', my PR introduced a default token type which is 'normal'. There is no difference in behavior.

goerch · 2023-10-03T17:17:36Z

This PR explicitly writes every token type as 'normal', my PR introduced a default token type which is 'normal'. There is no difference in behavior.

Added tokens get type CONTROL.

cebtenzzre · 2023-10-03T17:21:23Z

Added tokens get type CONTROL.

That code is part of convert.py, which is only for use with LLaMA, and is not referenced by the other scripts. I did not touch convert.py in my PR.

llama.cpp/convert.py

Line 44 in 48be797

ARCH = gguf.MODEL_ARCH.LLAMA

goerch · 2023-10-03T18:48:38Z

That code is part of convert.py, which is only for use with LLaMA, and is not referenced by the other scripts. I did not touch convert.py in my PR.

Point taken. I can't check it currently, but we also tested Aquila. Your comment indicates a possible other problem.

cebtenzzre · 2023-10-03T18:56:48Z

Point taken. I can't check it currently, but we also tested Aquila. Your comment indicates a possible other problem.

Currently this is handled by gguf.SpecialVocab in the simple scripts, and it's left to llama.cpp to handle tokenizer.ggml.bos_token_id/tokenizer.ggml.eos_token_id/tokenizer.ggml.unknown_token_id correctly.

goerch · 2023-10-03T19:36:07Z

Currently this is handled by gguf.SpecialVocab in the simple scripts, and it's left to llama.cpp to handle tokenizer.ggml.bos_token_id/tokenizer.ggml.eos_token_id/tokenizer.ggml.unknown_token_id correctly.

Here are the Aquila added tokens

"added_tokens": [
    {
      "id": 0,
      "content": "<|endoftext|>",
      "special": true
      [...]
    },
    {
      "id": 100000,
      "content": "<|startofpiece|>",
      "special": false
      [...]
    },
    {
      "id": 100001,
      "content": "<|endofpiece|>",
      "special": false
      [...]
    },
    {
      "id": 100002,
      "content": "<|LDWANG|>",
      "special": false
      [...]
    },
    {
      "id": 100003,
      "content": "[MASK]",
      "special": false
      [...]
    },
    {
      "id": 100004,
      "content": "[gMASK]",
      "special": false
      [...]
    },
    {
      "id": 100005,
      "content": "[sMASK]",
      "special": false
      [...]
    },
    {
      "id": 100006,
      "content": "[CLS]",
      "special": false
      [...]
    },
    {
      "id": 100007,
      "content": "</s>",
      "special": false
      [...]
    }
  ],

which all look like CONTROL tokens that should be filtered out.

Edit: added the special attribute.

cebtenzzre · 2023-10-04T17:58:02Z

Here are the Aquila added tokens

You're still talking about convert.py, I'm still not... I guess I don't understand the point you're trying to make. What should I be paying attention to here?

staviq · 2023-10-04T19:09:24Z

llama.cpp


    // determine the newline token: LLaMA "<0x0A>" == 10 == '\n', Falcon 193 == '\n'
    if (vocab.type == LLAMA_VOCAB_TYPE_SPM) {
        vocab.linefeed_id = llama_byte_to_token(vocab, '\n');
    } else {
-        vocab.linefeed_id = llama_tokenize_internal(vocab, "\n", false)[0];
+        vocab.linefeed_id = llama_tokenize_internal(vocab, "\u010A", false)[0];


I just noticed it, unicode linefeed is \u000A

Is that intentional ?

…example * 'master' of github.com:ggerganov/llama.cpp: (24 commits) convert : fix Baichuan2 models by using vocab size in config.json (ggerganov#3299) readme : add project status link ggml : fix build after ggerganov#3329 llm : add Refact model (ggerganov#3329) sync : ggml (conv 1d + 2d updates, UB fixes) (ggerganov#3468) finetune : readme fix typo (ggerganov#3465) ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggerganov#3453) main : consistent prefix/suffix coloring (ggerganov#3425) llama : fix session saving/loading (ggerganov#3400) llama : expose model's rope_freq_scale in the API (ggerganov#3418) metal : alibi for arbitrary number of heads (ggerganov#3426) cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggerganov#3273) Work on the BPE tokenizer (ggerganov#3252) convert : fix vocab size when not defined in hparams (ggerganov#3421) cmake : increase minimum version for add_link_options (ggerganov#3444) CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402) gguf : add BERT, MPT, and GPT-J arch info (ggerganov#3408) gguf : general usability improvements (ggerganov#3409) cmake : make CUDA flags more similar to the Makefile (ggerganov#3420) finetune : fix ggerganov#3404 (ggerganov#3437) ...

This reverts commit ff5a3f0.

goerch · 2023-10-06T07:48:38Z

You're still talking about convert.py, I'm still not... I guess I don't understand the point you're trying to make. What should I be paying attention to here?

Aquila is a LlaMa-like model with a BPE tokenizer, therefore we use convert.py. Do you propose to handle BPE tokenizer differently dependent on context?

cebtenzzre · 2023-10-06T15:02:56Z

Do you propose to handle BPE tokenizer differently dependent on context?

No, but I have never used Aquila or looked into its conversion code so I am not familiar with how its vocabulary is processed. If you think there is something the simple scripts are missing that would ensure correct functionality of the tokenizer, you'll have to clearly explain it.

@jploski

* Work on the BPE tokenizer Tokenizer tests work for Falcon-7B * Try to fix build problem * Fix debug assertion failure * Fix MSVC Unicode BOM problem * Cleanup and an improvement * Fix compiler warning * Cleanup * Test doesn't work over the full range of Unicodes * Update .gitignore and Makefile * Another Makefile rule * Testing Aquila * Moving byte decoding back to `token_to_piece` ... ... because everyone is using it. * Guarding some unusable code pathes * Streamlining code and adding some more assertions Important change: I'm classifying added tokens as control tokens now for BPE. * Adding a comment * Adding another assertion * Fixed vocabulary guarding assertions * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fix PR for recent change * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fixes for more compiler warnings * Remove unused code * Fix initialization of static maps * Add scores and token types back, adapt gptneox * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update unicode.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update unicode.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Ported Starcoder and added some assertions * Fix coding style * Apply @jploski 's fix for missing tokens --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

maddes8cht · 2023-10-19T23:39:41Z

There is a whole collection of .gguf falcon models derived from the "original" falcon 7b and 40b models at https://huggingface.co/maddes8cht
These models now no longer run in the current llama.cpp versions, but they also cannot be converted with the new convert-falcon-hf-to-gguf.py.
I don't see the model manufacturers converting their models to the new Falcon 7b / 40b variant.
Are all these models therefore lost for the new llama.cpp versions?

cebtenzzre · 2023-10-20T02:50:27Z

Are all these models therefore lost for the new llama.cpp versions?

I am adding it back in #3680. Although it really wouldn't be hard for the model authors to update their config.json files, as the model weights haven't changed.

maddes8cht · 2023-10-20T05:17:42Z

Ah, thanks, that is good news!
So the other derived models are now available as well.
There is now a separate Falcon based collection for them, while the quantized Versions of the original tiiuae falcon models reside under their own collection.

maddes8cht · 2023-11-01T17:32:25Z

A large bunch of falcon Models

with the new tokenizer
with the restored compatibility for the older .json files
with the new (pseudo?) K-Quants also for 7b-Models
is available at https://huggingface.co/maddes8cht

It seems like upstream isn't interested in this change for the time being [1], and we are going to break compatiblity with Nomic's previous conversion of MPT because of changes to the BPE tokenizer [2], so let's remove this change to minimize the diff. This reverts commit 69c505e. [1] ggerganov#3626 [2] ggerganov#3252

Work on the BPE tokenizer

bfaab6f

Tokenizer tests work for Falcon-7B

goerch added 2 commits September 18, 2023 20:15

Try to fix build problem

89e74c6

Fix debug assertion failure

7770423

Fix MSVC Unicode BOM problem

37cf135

This was referenced Sep 18, 2023

byte_decoder -> byte_encoder #3021

Closed

convert.py: BPE fixes? #2938

Merged

goerch added 4 commits September 18, 2023 22:53

Cleanup and an improvement

91a527a

Fix compiler warning

208d3d7

Cleanup

407f76d

Test doesn't work over the full range of Unicodes

311fcf1

goerch added 2 commits September 19, 2023 00:00

Update .gitignore and Makefile

c85cb29

Another Makefile rule

048e659

Testing Aquila

c0990bb

goerch mentioned this pull request Sep 19, 2023

convert.py : handle special tokens #2820

Closed

Moving byte decoding back to token_to_piece ...

1b7c369

... because everyone is using it.

staviq reviewed Oct 4, 2023

View reviewed changes

BarfingLemurs mentioned this pull request Oct 5, 2023

[Falcon] Attempting to run Falcon-180B Q5/6 give "illegal character" #3484

Closed

4 tasks

cebtenzzre mentioned this pull request Oct 5, 2023

convert : update Falcon script for new HF config #3448

Merged

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Oct 5, 2023

Revert "Work on the BPE tokenizer (ggerganov#3252)"

bf06349

This reverts commit ff5a3f0.

This was referenced Oct 6, 2023

examples for iOS (objc / swift ui) #3284

Closed

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

Closed

goerch deleted the falcon-tokenizer branch October 22, 2023 15:20

This was referenced Oct 25, 2023

convert : restore compat with old Falcon models #3680

Merged

convert : restore Falcon vocab padding #3864

Closed

cebtenzzre mentioned this pull request Nov 29, 2023

deepseek-coder-6.7B-instruct-GGUF crashes the app when selected nomic-ai/gpt4all#1617

Closed

2 tasks

cebtenzzre mentioned this pull request Mar 25, 2024

Fallback to tokenizer.json if vocab.json does not exist #6245

Closed

cebtenzzre added a commit that referenced this pull request Mar 27, 2024

convert-hf : fix type of tokens after #3252

074ded4

cebtenzzre added a commit that referenced this pull request Mar 27, 2024

convert-hf : fix type of tokens after #3252

2e6fd63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work on the BPE tokenizer #3252

Work on the BPE tokenizer #3252

goerch commented Sep 18, 2023

cebtenzzre commented Sep 18, 2023

staviq commented Sep 18, 2023

Green-Sky commented Sep 18, 2023

staviq commented Sep 18, 2023

KerfuffleV2 commented Sep 19, 2023

goerch commented Sep 19, 2023

KerfuffleV2 commented Sep 19, 2023 •

edited

Loading

goerch commented Sep 19, 2023 •

edited

Loading

KerfuffleV2 commented Sep 19, 2023 •

edited

Loading

goerch commented Sep 19, 2023 •

edited

Loading

KerfuffleV2 commented Sep 19, 2023

goerch commented Sep 19, 2023 •

edited

Loading

goerch commented Sep 19, 2023

KerfuffleV2 commented Sep 19, 2023

goerch commented Sep 19, 2023 •

edited

Loading

cebtenzzre commented Oct 3, 2023 •

edited

Loading

goerch commented Oct 3, 2023

cebtenzzre commented Oct 3, 2023

goerch commented Oct 3, 2023

cebtenzzre commented Oct 3, 2023

goerch commented Oct 3, 2023 •

edited

Loading

cebtenzzre commented Oct 4, 2023

staviq Oct 4, 2023

goerch Oct 6, 2023

goerch commented Oct 6, 2023

cebtenzzre commented Oct 6, 2023 •

edited

Loading

maddes8cht commented Oct 19, 2023

cebtenzzre commented Oct 20, 2023

maddes8cht commented Oct 20, 2023 •

edited

Loading

maddes8cht commented Nov 1, 2023

Work on the BPE tokenizer #3252

Work on the BPE tokenizer #3252

Conversation

goerch commented Sep 18, 2023

cebtenzzre commented Sep 18, 2023

staviq commented Sep 18, 2023

Green-Sky commented Sep 18, 2023

staviq commented Sep 18, 2023

KerfuffleV2 commented Sep 19, 2023

goerch commented Sep 19, 2023

KerfuffleV2 commented Sep 19, 2023 • edited Loading

goerch commented Sep 19, 2023 • edited Loading

KerfuffleV2 commented Sep 19, 2023 • edited Loading

goerch commented Sep 19, 2023 • edited Loading

KerfuffleV2 commented Sep 19, 2023

goerch commented Sep 19, 2023 • edited Loading

goerch commented Sep 19, 2023

KerfuffleV2 commented Sep 19, 2023

goerch commented Sep 19, 2023 • edited Loading

cebtenzzre commented Oct 3, 2023 • edited Loading

goerch commented Oct 3, 2023

cebtenzzre commented Oct 3, 2023

goerch commented Oct 3, 2023

cebtenzzre commented Oct 3, 2023

goerch commented Oct 3, 2023 • edited Loading

cebtenzzre commented Oct 4, 2023

staviq Oct 4, 2023

Choose a reason for hiding this comment

goerch Oct 6, 2023

Choose a reason for hiding this comment

goerch commented Oct 6, 2023

cebtenzzre commented Oct 6, 2023 • edited Loading

maddes8cht commented Oct 19, 2023

cebtenzzre commented Oct 20, 2023

maddes8cht commented Oct 20, 2023 • edited Loading

maddes8cht commented Nov 1, 2023

KerfuffleV2 commented Sep 19, 2023 •

edited

Loading

goerch commented Sep 19, 2023 •

edited

Loading

KerfuffleV2 commented Sep 19, 2023 •

edited

Loading

goerch commented Sep 19, 2023 •

edited

Loading

goerch commented Sep 19, 2023 •

edited

Loading

goerch commented Sep 19, 2023 •

edited

Loading

cebtenzzre commented Oct 3, 2023 •

edited

Loading

goerch commented Oct 3, 2023 •

edited

Loading

cebtenzzre commented Oct 6, 2023 •

edited

Loading

maddes8cht commented Oct 20, 2023 •

edited

Loading