Minor Terminology and Documentation Updates for Local Tokenizer Loading #134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've been closely monitoring the DataTrove project and utilizing it in my workflow due to its efficient pipelining capabilities. Thanks for all the hard work on this project. Really appreciate it.
Given our environment's restriction on external internet access, the ability to load tokenizers from local files rather than exclusively from Hugging Face (HF) is crucial for us. I was delighted to discover that recent updates have added the capability to load tokenizers locally.
Although I had prepared to contribute this specific feature, upon noticing its implementation, I opted to make some supplementary updates instead. These include renaming
tokenizer_name
totokenizer_name_or_path
and refining the related documentation to better align with the new functionality.I welcome any feedback or suggestions for further refinements. Thank you for your ongoing efforts to enhance DataTrove.