-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920
Merged
Merged
Changes from 1 commit
Commits
Show all changes
61 commits
Select commit
Hold shift + click to select a range
6fbab2d
merged the changes from deepseeker models to main branch
jaggzh d2cfc22
Moved regex patterns to unicode.cpp and updated unicode.h
dragnil1 54f93eb
Moved header files
dragnil1 1c924e4
Resolved issues
dragnil1 4056dc5
added and refactored unicode_regex_split and related functions
dragnil1 c8e7d95
Updated/merged the deepseek coder pr
jaggzh 4c3e882
Refactored code
dragnil1 a5710a4
Adding unicode regex mappings
dragnil1 7e308ed
Adding unicode regex function
dragnil1 feeaf4f
Added needed functionality, testing remains
dragnil1 7535803
Fixed issues
dragnil1 36d9832
Fixed issue with gpt2 regex custom preprocessor
dragnil1 06d3e69
unicode : fix? unicode_wstring_to_utf8
ggerganov c56e19d
lint : fix whitespaces
ggerganov 7a44e44
tests : add tokenizer tests for numbers
ggerganov d999cf6
unicode : remove redundant headers
ggerganov aeafb43
tests : remove and rename tokenizer test scripts
ggerganov e1b2bf7
tests : add sample usage
ggerganov ed42711
gguf-py : reader prints warnings on duplicate keys
ggerganov 4907e41
llama : towards llama3 tokenization support (wip)
ggerganov e8c206b
unicode : shot in the dark to fix tests on Windows
ggerganov e989176
unicode : first try custom implementations
ggerganov e3f6dc7
Merge branch 'master' into gg/bpe-preprocess
ggerganov 9b4d63a
convert : add "tokenizer.ggml.pre" GGUF KV (wip)
ggerganov 43e12ce
llama : use new pre-tokenizer type
ggerganov 1b9b79d
convert : fix pre-tokenizer type writing
ggerganov 8791e94
lint : fix
ggerganov a774d70
make : add test-tokenizer-0-llama-v3
ggerganov c160818
wip
ggerganov 96965f6
models : add llama v3 vocab file
ggerganov ad92983
llama : adapt punctuation regex + add llama 3 regex
ggerganov 4434c9d
minor
ggerganov a22645c
unicode : set bomb
ggerganov 2affd0b
unicode : set bomb
ggerganov ce5485a
unicode : always use std::wregex
ggerganov 91eaa41
unicode : support \p{N}, \p{L} and \p{P} natively
ggerganov 581c4a0
unicode : try fix windows
ggerganov b97add5
unicode : category support via std::regex
ggerganov d63cc90
Merge branch 'master' into gg/bpe-preprocess
ggerganov e972e6c
unicode : clean-up
ggerganov ee6d1b3
unicode : simplify
ggerganov 7642973
convert : add convert-hf-to-gguf-update.py
ggerganov 4e3e6d8
lint : update
ggerganov 1c888eb
convert : add falcon
ggerganov 1545550
unicode : normalize signatures
ggerganov 491f233
lint : fix
ggerganov e8dd4a1
lint : fix
ggerganov 02fd977
convert : remove unused functions
ggerganov 0f9058c
convert : add comments
ggerganov 7808150
convert : exercise contractions
ggerganov 7b1210f
lint : fix
ggerganov ef4cca9
cmake : refactor test targets
ggerganov 43708d2
tests : refactor vocab tests
ggerganov c68d259
tests : add more vocabs and tests
ggerganov af05268
unicode : cleanup
ggerganov c21ab18
scripts : ignore new update script in check-requirements.sh
ggerganov 120cf37
models : add phi-3, mpt, gpt-2, starcoder
ggerganov 9a7d430
tests : disable obsolete
ggerganov 6d6ce93
tests : use faster bpe test
ggerganov 3202676
llama : more prominent warning for old BPE models
ggerganov 80cb312
tests : disable test-tokenizer-1-bpe due to slowness
ggerganov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dragnil1 Not sure if this is the intent, but the following change of this function makes the tokenizer tests pass on my Mac. Do you think this is OK to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change converts UCS-2 or UCS-4/UTF-32 encoded
std::wstring
to UTF-8 encodedstd::string
and the previous one, converts UTF-16 encodedstd::wstring
to UTF-8 encodedstd::string
according to reference. Both works on Ubuntu(tested) but I am not sure about windows as it uses UTF-16 encodedstd::wstring
.