Mismatch to OpenAI's tokenizer? #29

hitsthings · 2023-08-10T09:26:51Z

hitsthings
Aug 10, 2023

I was trying to compare for correctness and it seems OpenAI counts an extra token in their organizer? The first <h is two tokens for them.

Is that related to #19 ? Or is there something else going on? Interestingly, the other models that gpt-tokenizer supports seem to match what is on the Tokenizer page (even though cl100k_base is listed as the gpt-3.5 turbo tokenizer).

As someone new to the repo, I'm sure I'm just ignorant and this is expected. It would be great to get help understanding the gotchas on when it might differ.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch to OpenAI's tokenizer? #29

{{title}}

Replies: 0 comments

Select a reply

Mismatch to OpenAI's tokenizer? #29

hitsthings Aug 10, 2023

Replies: 0 comments

hitsthings
Aug 10, 2023