Skip to content
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.

Doesn't consider combining characters. #59

Open
IDDT opened this issue Jan 21, 2020 · 0 comments
Open

Doesn't consider combining characters. #59

IDDT opened this issue Jan 21, 2020 · 0 comments

Comments

@IDDT
Copy link

IDDT commented Jan 21, 2020

In several languages there are graphemes that consist of several characters - typically it's a base followed by one or many combining characters. For example: a + ◌̈ = .

Youtokentome assumes that every character is a valid grapheme and generates tokens that may start with a combining character.

If would be beneficial to train and encode with an option to pre-merge all combining characters to their base characters before running the actual BPE.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant