All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
bicleaner-ai-download
quiet mode.
- Update hardrules to 2.9.0, now hardrules accepts HF identifiers to load metadata.
- Full models download from HF now accept a local path to store the model instead of using HF cache.
- Update HuggingFace Transformers and Hub.
- Update documentation about downloading models and managing HF cache.
- Use a single worker to download from HF, fixes timeout errors during download.
- Fix '--rules_config' parameter.
- Fix HF downloads in some slow docker instances with increased etag timeout.
- Fix builds in PIP>=23 with new Hardrules and FastSpell.
- Support tokenizing by characters (useful for Chinese).
- CLI option to configure minimum words/tokens to be omited/replaced.
- Refactored Tokenizer class.
- Update HF hub.
- Removed external tokenizer option.
- Improved begginers guide.
- Create only one Tokenizer object per process in noise function.
- Update Hardrules to 2.8.0
- Better coverage of Icelandic langid
- Updated KenLM installation instructions.
- KenLM installation.
- Upload full models to Hugging Face Hub.
- Automatic download of full models.
- Hide Tensorflow and Transformers logging messages in executable scripts.
- Redirect Keras prediction progress bar to stderr.
- Huge memory improvements during training.
- Speed improvements using pading
longest
instead ofmax_length
- Models are more insensitive to the presence of capital letter at the start of the sentence.
- Improved performance on HBS Cyrillic transliterating in models which had poor training on cyrillic text.
- Basic test suite.
- Allow changing the base model for XLMR. Any XLMRoberta model can be used.
- Migrate to
pyproject.toml
andsrc/
tree structure, comply with PEP517, PEP518 and PEP621. - Update to Hardrules 2.6
- Rules can be parametrized with
--rules_config config.yaml
- Some rules have been refactored with better names.
--run_all_rules
mode to run each rule instead of stoppping at first discard- Language identification with FastSpell
- Better Serbo-Croatian and Slovene language detection.
- Easier installation! Now KenLM comes pre-compiled.
- Rules can be parametrized with
- Now BICLEANER_AI_THREADS environment variable controls the number of threads.
- Update HF Transformers.
- Update TensorFlow minimum version.
- Removed
glove-python
dependency and use own custom compilation. - Improved download scripts, easier to install and use.
- Set inter/intra_op parallelism to 0 by default.
- Block size by default to 10k, a bit faster.
- Faster noise generation for small datasets with lower block size.
- Model argument can be provided with or without 'metadata.yaml'.
- Add citation info to README.
- Avoid generating empty sentences in omit noise.
- Restore capital letters at the beggining of the sentennce in frequency noise.
- Retrocompatibility with older models.
- Compatibility of
glove
with Python>=3.7. - Fix loading lite models in other Python versions than 3.8.
- Fix unbound variable
lm_stats
. - Other minor fixes.
- Update hardrules to 1.2: adds score only mode.
- Bicleaner train changes:
- Separate most of the training logic in the BaseModel class.
- Re-factor synthetic noise build function.
- Parallelize synthetic noise generation.
- Add fuzzy matching noise and neighbour noise.
- Add Decomposable Attention model.
- Add Transkformer-like model.
- Add XLMRoberta model.
- Bicleaner classify changes:
- Change old classifier by new neural models.
- Move hardrules into a separate package.