Releases: huggingface/tokenizers
Releases · huggingface/tokenizers
v0.14.0
- #1335, AddedToken is reworked,
is_special_token
rename tospecial
for consistency - feature http is now
OFF
by default, and depends on hf-hub instead of cached_path (updated cache directory, better sync implementation) - Removed SSL link on the python package, calling huggingface_hub directly instead.
- New dependency : huggingface_hub (while we deprecate Tokenizer.from_pretrained(...) to Tokenizer.from_file(hugginngface_hub.hf_hub_download(MODEL_ID, "tokenizer.json")
What's Changed
- Fix conda release by @ArthurZucker in #1211
- Fix node release by @ArthurZucker in #1212
- Printing warning to stderr. by @Narsil in #1222
- Fixing padding_left sequence_ids. by @Narsil in #1233
- Use LTO for release and benchmark builds by @csko in #1157
- fix unigram.rs test_sample() by @chris-ha458 in #1244
- implement a simple max_sentencepiece_length into BPE by @chris-ha458 in #1228
- Makes
decode
anddecode_batch
work on borrowed content. by @mfuntowicz in #1251 - Update all GH Actions with dependency on actions/checkout by @mfuntowicz in #1256
- Parallelize unigram trainer by @mishig25 in #976
- Update unigram/trainer.rs by @chris-ha458 in #1257
- Fixing broken link. by @Narsil in #1268
- fix documentation regarding regex by @chris-ha458 in #1264
- Update Cargo.toml by @chris-ha458 in #1266
- Update README.md - Broken link by @sbhavani in #1272
- [doc build] Use secrets by @mishig25 in #1273
- Improve error for truncation with too high stride by @boyleconnor in #1275
- Add unigram bytefallback by @ArthurZucker in #1217
- revise type specification by @hiroshi-matsuda-rit in #1289
- Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in #1291
- Update path name: master -> main by @bact in #1292
- import Tuple from typing by @kellymarchisio in #1295
- Fixing clippy warnings on 1.71. by @Narsil in #1296
- Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in #1299
- feat: Added CITATION.cff. by @SamuelLarkin in #1302
- Single warning for holes. by @Narsil in #1303
- Give error when initializing tokenizer with too high stride by @boyleconnor in #1306
- Handle when precompiled charsmap is empty by @kellymarchisio in #1308
- Derive clone for TrainerWrapper by @jonatanklosko in #1317
- CD backports by @chris-ha458 in #1318
- 0.13.4.rc1 by @Narsil in #1319
- Release all at once for simplicity. by @Narsil in #1320
- Fix stride condition. by @Narsil in #1321
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
- Reduce number of different revisions by 1 by @Narsil in #1329
- Python 38 arm by @Narsil in #1330
- Move to maturing mimicking move for
safetensors
. + Rewritten node bindings. by @Narsil in #1331 - Updating the docs with the new command. by @Narsil in #1333
- Update added tokens by @ArthurZucker in #1335
New Contributors
- @csko made their first contribution in #1157
- @chris-ha458 made their first contribution in #1244
- @sbhavani made their first contribution in #1272
- @boyleconnor made their first contribution in #1275
- @hiroshi-matsuda-rit made their first contribution in #1289
- @bact made their first contribution in #1292
- @kellymarchisio made their first contribution in #1295
- @SamuelLarkin made their first contribution in #1302
- @jonatanklosko made their first contribution in #1317
- @mikelui made their first contribution in #1322
Full Changelog: v0.13.3...v0.14.0
v0.14.0.rc1
Reworks the release pipeline. Other breaking changes are mostly related to #1335, where AddedToken is reworked
What's Changed
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
- Reduce number of different revisions by 1 by @Narsil in #1329
- Python 38 arm by @Narsil in #1330
- Move to maturing mimicking move for
safetensors
. + Rewritten node bindings. by @Narsil in #1331 - Updating the docs with the new command. by @Narsil in #1333
- Update added tokens by @ArthurZucker in #1335
New Contributors
Full Changelog: v0.13.4.rc2...v0.14.0.rc1
v0.13.4.rc3
Mostly checking the new release scripts actually work.
What's Changed
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
New Contributors
Full Changelog: v0.13.4.rc2...v0.13.4.rc3
v0.13.4.rc2
Python v0.13.4.rc1
What's Changed
- Update all GH Actions with dependency on actions/checkout by @mfuntowicz in #1256
- Parallelize unigram trainer by @mishig25 in #976
- Update unigram/trainer.rs by @chris-ha458 in #1257
- Fixing broken link. by @Narsil in #1268
- fix documentation regarding regex by @chris-ha458 in #1264
- Update Cargo.toml by @chris-ha458 in #1266
- Update README.md - Broken link by @sbhavani in #1272
- [doc build] Use secrets by @mishig25 in #1273
- Improve error for truncation with too high stride by @boyleconnor in #1275
- Add unigram bytefallback by @ArthurZucker in #1217
- revise type specification by @hiroshi-matsuda-rit in #1289
- Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in #1291
- Update path name: master -> main by @bact in #1292
- import Tuple from typing by @kellymarchisio in #1295
- Fixing clippy warnings on 1.71. by @Narsil in #1296
- Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in #1299
- feat: Added CITATION.cff. by @SamuelLarkin in #1302
- Single warning for holes. by @Narsil in #1303
- Give error when initializing tokenizer with too high stride by @boyleconnor in #1306
- Handle when precompiled charsmap is empty by @kellymarchisio in #1308
- Derive clone for TrainerWrapper by @jonatanklosko in #1317
- CD backports by @chris-ha458 in #1318
- 0.13.4.rc1 by @Narsil in #1319
New Contributors
- @sbhavani made their first contribution in #1272
- @boyleconnor made their first contribution in #1275
- @hiroshi-matsuda-rit made their first contribution in #1289
- @bact made their first contribution in #1292
- @kellymarchisio made their first contribution in #1295
- @SamuelLarkin made their first contribution in #1302
- @jonatanklosko made their first contribution in #1317
Full Changelog: v0.13.4-rc2...v0.13.4.rc1
v0.13.4-rc2: Makes `decode` and `decode_batch` work on borrowed content. (#1251)
Pre-release
* Makes `decode` and `decode_batch` work on borrowed content. * Make `decode_batch` work with borrowed content. * Fix lint. * Attempt to map it into Node. * Second attempt. * Step by step. * One more step. * Fix lint. * Please ... * Removing collect. * Revert "Removing collect." This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
v0.13.4-rc1
Never gonna make you cry
Rust v0.13.3
What's Changed
- Update pr docs actions by @mishig25 in #1101
- Adding rust audit. by @Narsil in #1099
- Revert "Update pr docs actions" by @mishig25 in #1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1108
- Include license file in Rust crate by @ankane in #1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in #1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1120
- Fixing conda ssl location by @Narsil in #1124
- Adding stale bot ? by @Narsil in #1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in #1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in #1127
- Wrap rustdoc html entity in code block by @hvaara in #1130
- Fix broken links in docs by @hvaara in #1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in #1129
- Ignore Cargo.lock for subfolders by @hvaara in #1131
- Fix one char super tiny typo by @fzyzcjy in #1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in #1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in #1140
- Add missing build targets by @Narsil in #1145
- Adding python 3.8 for M1 by @Narsil in #1147
- Made dirs optional by @ankane in #1148
- Update info on environment variable for threading by @mert-kurttutan in #1150
- Making
Tokenizer
clone. by @Narsil in #1152 - Prevent using
from_pretrained
on invalid ids (better error message). by @Narsil in #1153 - Improved version. by @Narsil in #1154
- Update model.rs by @thomasw21 in #1166
- Using clippy 1.67 by @Narsil in #1167
- pyo3 v0.18 migration by @mert-kurttutan in #1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in #1182
- Bump dirs from 3.0 to 4.0 by @hvaara in #1142
- Adding ByteFallback support for
tokenizers
. by @Narsil in #1183 - Faster
datasets
train example by @lhoestq in #1192 - Adding
Replace
to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in #1195 - Creating
normalizers.Prepend
(To be used instead ofMetaspace
). by @Narsil in #1194 - Adding 2 new decoders: by @Narsil in #1196
- Fixing decoder strip because of char boundaries. by @Narsil in #1197
- Add
content
to Strip decoder to allow decoding mid tokens. by @Narsil in #1199 - New version 0.13.3 by @Narsil in #1205
- New release by @ArthurZucker in #1207
New Contributors
- @ankane made their first contribution in #1115
- @SeongBeomLEE made their first contribution in #1120
- @hvaara made their first contribution in #1127
- @fzyzcjy made their first contribution in #1137
- @mert-kurttutan made their first contribution in #1150
- @lhoestq made their first contribution in #1192
Full Changelog: v0.13.2...v0.13.3
Python v0.13.3
What's Changed
- Update pr docs actions by @mishig25 in #1101
- Adding rust audit. by @Narsil in #1099
- Revert "Update pr docs actions" by @mishig25 in #1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1108
- Include license file in Rust crate by @ankane in #1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in #1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1120
- Fixing conda ssl location by @Narsil in #1124
- Adding stale bot ? by @Narsil in #1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in #1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in #1127
- Wrap rustdoc html entity in code block by @hvaara in #1130
- Fix broken links in docs by @hvaara in #1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in #1129
- Ignore Cargo.lock for subfolders by @hvaara in #1131
- Fix one char super tiny typo by @fzyzcjy in #1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in #1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in #1140
- Add missing build targets by @Narsil in #1145
- Adding python 3.8 for M1 by @Narsil in #1147
- Made dirs optional by @ankane in #1148
- Update info on environment variable for threading by @mert-kurttutan in #1150
- Making
Tokenizer
clone. by @Narsil in #1152 - Prevent using
from_pretrained
on invalid ids (better error message). by @Narsil in #1153 - Improved version. by @Narsil in #1154
- Update model.rs by @thomasw21 in #1166
- Using clippy 1.67 by @Narsil in #1167
- pyo3 v0.18 migration by @mert-kurttutan in #1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in #1182
- Bump dirs from 3.0 to 4.0 by @hvaara in #1142
- Adding ByteFallback support for
tokenizers
. by @Narsil in #1183 - Faster
datasets
train example by @lhoestq in #1192 - Adding
Replace
to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in #1195 - Creating
normalizers.Prepend
(To be used instead ofMetaspace
). by @Narsil in #1194 - Adding 2 new decoders: by @Narsil in #1196
- Fixing decoder strip because of char boundaries. by @Narsil in #1197
- Add
content
to Strip decoder to allow decoding mid tokens. by @Narsil in #1199
New Contributors
- @ankane made their first contribution in #1115
- @SeongBeomLEE made their first contribution in #1120
- @hvaara made their first contribution in #1127
- @fzyzcjy made their first contribution in #1137
- @mert-kurttutan made their first contribution in #1150
- @lhoestq made their first contribution in #1192
Full Changelog: node-v0.13.2...python-v0.13.3rc1
What's Changed
- Update pr docs actions by @mishig25 in #1101
- Adding rust audit. by @Narsil in #1099
- Revert "Update pr docs actions" by @mishig25 in #1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1108
- Include license file in Rust crate by @ankane in #1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in #1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1120
- Fixing conda ssl location by @Narsil in #1124
- Adding stale bot ? by @Narsil in #1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in #1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in #1127
- Wrap rustdoc html entity in code block by @hvaara in #1130
- Fix broken links in docs by @hvaara in #1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in #1129
- Ignore Cargo.lock for subfolders by @hvaara in #1131
- Fix one char super tiny typo by @fzyzcjy in #1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in #1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in #1140
- Add missing build targets by @Narsil in #1145
- Adding python 3.8 for M1 by @Narsil in #1147
- Made dirs optional by @ankane in #1148
- Update info on environment variable for threading by @mert-kurttutan in #1150
- Making
Tokenizer
clone. by @Narsil in #1152 - Prevent using
from_pretrained
on invalid ids (better error message). by @Narsil in #1153 - Improved version. by @Narsil in #1154
- Update model.rs by @thomasw21 in #1166
- Using clippy 1.67 by @Narsil in #1167
- pyo3 v0.18 migration by @mert-kurttutan in #1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in #1182
- Bump dirs from 3.0 to 4.0 by @hvaara in #1142
- Adding ByteFallback support for
tokenizers
. by @Narsil in #1183 - Faster
datasets
train example by @lhoestq in #1192 - Adding
Replace
to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in #1195 - Creating
normalizers.Prepend
(To be used instead ofMetaspace
). by @Narsil in #1194 - Adding 2 new decoders: by @Narsil in #1196
- Fixing decoder strip because of char boundaries. by @Narsil in #1197
- Add
content
to Strip decoder to allow decoding mid tokens. by @Narsil in #1199 - New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/...
Node v0.13.3
What's Changed
- Update pr docs actions by @mishig25 in #1101
- Adding rust audit. by @Narsil in #1099
- Revert "Update pr docs actions" by @mishig25 in #1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1108
- Include license file in Rust crate by @ankane in #1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in #1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1120
- Fixing conda ssl location by @Narsil in #1124
- Adding stale bot ? by @Narsil in #1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in #1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in #1127
- Wrap rustdoc html entity in code block by @hvaara in #1130
- Fix broken links in docs by @hvaara in #1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in #1129
- Ignore Cargo.lock for subfolders by @hvaara in #1131
- Fix one char super tiny typo by @fzyzcjy in #1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in #1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in #1140
- Add missing build targets by @Narsil in #1145
- Adding python 3.8 for M1 by @Narsil in #1147
- Made dirs optional by @ankane in #1148
- Update info on environment variable for threading by @mert-kurttutan in #1150
- Making
Tokenizer
clone. by @Narsil in #1152 - Prevent using
from_pretrained
on invalid ids (better error message). by @Narsil in #1153 - Improved version. by @Narsil in #1154
- Update model.rs by @thomasw21 in #1166
- Using clippy 1.67 by @Narsil in #1167
- pyo3 v0.18 migration by @mert-kurttutan in #1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in #1182
- Bump dirs from 3.0 to 4.0 by @hvaara in #1142
- Adding ByteFallback support for
tokenizers
. by @Narsil in #1183 - Faster
datasets
train example by @lhoestq in #1192 - Adding
Replace
to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in #1195 - Creating
normalizers.Prepend
(To be used instead ofMetaspace
). by @Narsil in #1194 - Adding 2 new decoders: by @Narsil in #1196
- Fixing decoder strip because of char boundaries. by @Narsil in #1197
- Add
content
to Strip decoder to allow decoding mid tokens. by @Narsil in #1199
New Contributors
- @ankane made their first contribution in #1115
- @SeongBeomLEE made their first contribution in #1120
- @hvaara made their first contribution in #1127
- @fzyzcjy made their first contribution in #1137
- @mert-kurttutan made their first contribution in #1150
- @lhoestq made their first contribution in #1192
Full Changelog: node-v0.13.2...python-v0.13.3rc1
What's Changed
- Update pr docs actions by @mishig25 in #1101
- Adding rust audit. by @Narsil in #1099
- Revert "Update pr docs actions" by @mishig25 in #1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1108
- Include license file in Rust crate by @ankane in #1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in #1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1120
- Fixing conda ssl location by @Narsil in #1124
- Adding stale bot ? by @Narsil in #1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in #1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in #1127
- Wrap rustdoc html entity in code block by @hvaara in #1130
- Fix broken links in docs by @hvaara in #1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in #1129
- Ignore Cargo.lock for subfolders by @hvaara in #1131
- Fix one char super tiny typo by @fzyzcjy in #1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in #1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in #1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in #1140
- Add missing build targets by @Narsil in #1145
- Adding python 3.8 for M1 by @Narsil in #1147
- Made dirs optional by @ankane in #1148
- Update info on environment variable for threading by @mert-kurttutan in #1150
- Making
Tokenizer
clone. by @Narsil in #1152 - Prevent using
from_pretrained
on invalid ids (better error message). by @Narsil in #1153 - Improved version. by @Narsil in #1154
- Update model.rs by @thomasw21 in #1166
- Using clippy 1.67 by @Narsil in #1167
- pyo3 v0.18 migration by @mert-kurttutan in #1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in #1182
- Bump dirs from 3.0 to 4.0 by @hvaara in #1142
- Adding ByteFallback support for
tokenizers
. by @Narsil in #1183 - Faster
datasets
train example by @lhoestq in #1192 - Adding
Replace
to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in #1195 - Creating
normalizers.Prepend
(To be used instead ofMetaspace
). by @Narsil in #1194 - Adding 2 new decoders: by @Narsil in #1196
- Fixing decoder strip because of char boundaries. by @Narsil in #1197
- Add
content
to Strip decoder to allow decoding mid tokens. by @Narsil in #1199 - New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/...