Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This module is not ready for CJK characters #16

Closed
mashihua opened this issue Jun 8, 2023 · 3 comments
Closed

This module is not ready for CJK characters #16

mashihua opened this issue Jun 8, 2023 · 3 comments
Labels
invalid This doesn't seem right

Comments

@mashihua
Copy link

mashihua commented Jun 8, 2023

We found that this module is not ready for CJK characters, when type ここに内容を入力すると、消費されるメダルの数が計算されます。

OpenAI show:

截屏2023-06-08 15 11 21

This module show

截屏2023-06-08 15 12 04

The token is different to OpenAI.

@xnohat
Copy link

xnohat commented Jun 8, 2023

Above you use GPT-3 Encoder and below you use cl100k_base Encoder for GPT3.5 and GPT4
They are 2 difference token encoder , out 2 difference tokens set output

@foloinfo
Copy link

I checked the output with the same string with p50k_base and it seems to give the same result to OpenAI Tokenizer.
I also tested with a longer string (800 characters) and the number of tokens was the same.
I think it's working fine in CJK.

@niieani niieani added the invalid This doesn't seem right label Jul 18, 2024
@niieani
Copy link
Owner

niieani commented Jul 18, 2024

as folks in the replies explained, you have selected the incorrect encoder. The tokenizer works correctly.

@niieani niieani closed this as not planned Won't fix, can't repro, duplicate, stale Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

4 participants