Skip to content

v0.7.8

Compare
Choose a tag to compare
@guenthermi guenthermi released this 08 Jun 13:51
· 8 commits to main since this release
25ac807

Release Note Finetuner 0.7.8

This release covers Finetuner version 0.7.8, including dependencies finetuner-api 0.5.10 and finetuner-core 0.13.7.

This release contains 4 new features, 1 performance improvement, 1 refactoring, 2 bug fixes, and 1 documentation improvement.

🆕 Features

Add multilingual text encoder models

We have added support for the multilingual embedding model distiluse-base-multi (a copy of distiluse-base-multilingual-cased-v1). It supports semantic search in Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, and Turkish.

Add multilingual model for training data synthesis jobs (#750)

We now support data synthesis for datasets in languages other than English, specifically the ones supported by distiluse-base-multi (see above). To use them you need to add the synthesis model synthesis_model_multi as the models parameter to the finetuner.synthesis function:

from finetuner.model import synthesis_model_multi

synthesis_run = finetuner.synthesize(
    ...
    models=synthesis_model_multi,
)

Support loading models directly from Jina's huggingface site (#751)

We will soon publish select fine-tuned models to the huggingface hub. With the new Finetuner version, you can now load those models directly:

import finetuner

model = finetuner.get_model('jinaai/ecommerce-sbert-model')
e1, e2 = finetuner.encode(model, ['XBox', 'Xbox One Console 500GB - Black (2015)'])

Add an option to the tracking callback to include zero-shot metrics in logging.

Previously, tracking callbacks like WandBLogger did not consider the evaluation results of the model before fine-tuning, because they only start the tracking when the actual model tuning starts. Now, we add an option log_zero_shot to those callbacks (which is True by default). When enabled, this makes Finetuner send evaluation metrics calculated before training to the tracking service used during training.

🚀 Performance

Reduce memory consumption during data synthesis and make the resulting dataset more compact

We optimized data synthesis to reduce its memory consumption, which enables synthesis jobs to run on larger datasets and reduces the run-time of fine-tuning jobs using synthesized training data.

⚙ Refactoring

Increase the default num_relations from 3 to 10 for data synthesis jobs. (#750)

Data synthesis jobs are more effective if a large amount of training data is generated from small and medium-sized query datasets. Therefore, we have increased the default number of triplets generated for each query from 3 to 10. If you run data synthesis jobs with a large number of queries (>1M), you should consider resetting the num_relations parameter to a lower number.

🐞 Bug Fixes

Change the English cross-encoder model from multi-lingual to an actual English model.

The English cross-encoder model which we used was actually a multi-lingual one. By using an English one instead, we produce higher-quality synthetic training data and the resulting embedding models achieve better evaluation results.

Fix create synthesis run not accepting DocumentArray as input type. (#748)

We noticed that data synthesis jobs can accept either a named DocumentArray object stored on Jina AI Cloud or a list of text values. However, passing file paths to locally stored DocumentArray datasets failed. This bug is fixed by this release.

📗 Documentation Improvements

Update data synthesis tutorial including English and multilingual models. (#750)

We have added documentation on how to apply data synthesis to datasets that include materials in languages other than English.

🤟 Contributors

We would like to thank all contributors to this release: