Releases: IBM/unitxt
Unitxt 1.13.0
Unitxt 1.13.0 - Multi Modality and Types
New type handling capabilities
The most significant change in this release is the introduction of type serializers to unitxt.
Type serializers in charge of taking a specific type of data structure such as Table, or Dialog and serialize it to textual representation.
Now you can define tasks in unitxt that have complex types such as Table or Dialog and define serializers that handle their transformation to text.
This allows to control the representation of different types from the recipe api:
from unitxt import load_dataset
from unitxt.struct_data_operators import SerializeTableAsMarkdown
serializer = SerializeTableAsMarkdown(shuffle_rows=True, seed=0)
dataset = load_dataset(card="cards.wikitq", template_card_index=0, serializer=serializer)
And if you want to serialize this table differently you can change any of the many available table serializers.
Defining New Type
If you wish to define a new type with custom serializers you can do so by using python typing
library:
from typing import Any, List, TypedDict
class Table(TypedDict):
header: List[str]
rows: List[List[Any]]
Once your type is ready you should register it to unitxt type handling within the code you are running:
from unitxt.type_utils import register_type
register_type(Table)
Now your type can be used anywhere across unitxt (e.g in task definition or serializers).
Defining a Serializer For a Type
If you want to define a serializer for your custom type or any typing type combination you can do so by:
class MySerizlizer(SingleTypeSerializer):
serialized_type = Table
def serialize(self, value: Table, instance: Dict[str, Any]) -> str:
# your code to turn value of type Table to string
Multi-Modality
You now can process Image-Text to Text or Image-Audio to Text datasets in unitxt.
For example if you want to load the doc-vqa dataset you can do so by:
from unitxt import load_dataset
dataset = load_dataset(
card="cards.doc_vqa.en",
template="templates.qa.with_context.title",
format="formats.models.llava_interleave",
loader_limit=20,
)
Since we have data augmentation mechanisms it is just natural to use it for images. For example if you want your images in grey scale:
dataset = load_dataset(
card="cards.doc_vqa.en",
template="templates.qa.with_context.title",
format="formats.models.llava_interleave",
loader_limit=20,
augmentor="augmentors.image.grey_scale", # <= Just like the text augmenters!
)
Then if you want to get the scores of a model on this dataset you can use:
from unitxt.inference import HFLlavaInferenceEngine
from unitxt.text_utils import print_dict
from unitxt import evaluate
inference_model = HFLlavaInferenceEngine(
model_name="llava-hf/llava-interleave-qwen-0.5b-hf", max_new_tokens=32
)
test_dataset = dataset["test"].select(range(5))
predictions = inference_model.infer(test_dataset)
evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)
print_dict(
evaluated_dataset[0],
keys_to_print=["source", "media", "references", "processed_prediction", "score"],
)
Multi modality support in unitxt is building upon the type handling introduced in the previous section with two new types: Image and Audio.
What's Changed
- add revision option to hf loader by @OfirArviv in #1189
- Support dataset field in nested JSON files by @antonpibm in #1188
- Add TURL Table column type annotation task card by @csrajmohan in #1186
- Update operators.py - copy edits (grammar, consistency, clarity) by @welisheva22 in #1187
- Numeric nlg postproc by @ShirApp in #1185
- Add support for Literal, TypedDict and NewType for unitxt type checking by @elronbandel in #1191
- Scarebleu metric: remove mecab_ko and mecab_ko_dic from metric requir… by @eladven in #1197
- Add rag dataset + openai format dialog operator by @OfirArviv in #1192
- Update README.md by @elronbandel in #1198
- add decorator with init warning by @MikolajCharchut in #1200
- Add mock inference mode setting and allow testing without gen ai key by @elronbandel in #1204
- Fix using OpenAiInferenceEngine for LLMAsJudge by @yifanmai in #1194
- Add TogetherAiInferenceEngine by @yifanmai in #1203
- Fix OpenAiInferenceEngine by @yifanmai in #1193
- Add serializers to templates and reorganize and unite all templates by @elronbandel in #1195
- Add demos to task_data by @elronbandel in #1206
- Move test_context_correctness by @matanor in #1207
- Add image-text to text datasets by @elronbandel in #1211
- Refactor augmentors to be more scaleable + add image aumgentors by @elronbandel in #1212
- Fix grey scale augmentor and add to image example by @elronbandel in #1213
- Add images to UI by @elronbandel in #1216
- add unified decorator for warnings and unit tests by @MikolajCharchut in #1209
- Add templates list option to standard recipe by @elronbandel in #1219
- Use read token for huggingface datasets reading by @elronbandel in #1223
- add Llava-next system prompt by @OfirArviv in #1221
- Improve performance for huggingface tokenizer based format by @elronbandel in #1224
- Fix compute expression to use the instance variables as globals by @elronbandel in #1217
- Add generic inference engine to allow dynamic selection by the user by @eladven in #1226
- A suggested PR for issue 1106: More meaningful error message when catalog consistency fails by @dafnapension in #1201
- Add random templates for bluebench by @perlitz in #1222
- A suggested PR for issue #1214: fixed a bug in score_prefix for grouped instance scores by @dafnapension in #1228
- Add control over serizliers from recipe + improve serializers construction + allow seed for table shuffling serizliers by @elronbandel in #1229
- Fix table tasks to use default table serializers by @elronbandel in #1230
- Add concurency_limit parameter to WMLInferenceEngine by @elronbandel in #1231
- Add wml and generic based llmaj metric by @perlitz in #1227
- Update version to 1.13.0 by @elronbandel in #1232
New Contributors
- @MikolajCharchut made their first contribution in #1200
Full Changelog: 1.12.4...1.13.0
1.12.4
Main changes
- Enable to define benchmark in Unitxt by adding the ability to produce scores of groups based on task attributes and recipe metadata. For more information see https://www.unitxt.ai/en/latest/docs/benchmark.html by @elronbandel in #1130
- Enable inference/production APIs to support invocation by task without specifying a card. It enables using any task in the Unitxt catalog as an inference function. Check https://www.unitxt.ai/en/latest/docs/production.html for details (#957)
- Add support for multi-modality. For details see https://www.unitxt.ai/en/latest/docs/multimodality.html by @elronbandel in #1175
Additions to catalog
- Add ProvoQ dataset artifacts by @bnayahu in #1168
- Add Wikitq metric by @ShirApp in #1167
- Add more LLMs as judges ensembles by @pvn25 in #1171
- Add Scigen table2text task with llm_as_judge metric by @csrajmohan in #1134
New Features
- Add LLM as judge ensemble metrics, and add LLMaaJ ensemble example by @pvn25 in #1081
- Refactor RenameFields operator to Rename. The old operator is still supported but raises a deprecation warning by @elronbandel in #1123
Bug Fixes
- Make cache compatible with python 3.8 by @elronbandel in #1172
- Deprecated field used to print warning message with wrong reason @dafnapension in #1174
Documentation changes
- Update llm_as_judge.py --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1164
- Update formats.py --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1163
- Update loaders.py --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1162
- Update card.py - minor documentation changes by @welisheva22 in #1161
- Update adding_dataset.rst - a few more minor documentation changes by @welisheva22 in #1160
- Update artifact.py --- documentation edits (grammar, consistency, cla… by @welisheva22 in #1159
- Update glossary.rst --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1155
- Update helm.rst --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1154
- Update operators.py --- copy edits (grammar, consistency, clarity) - take 2 by @welisheva22 in #1158
- Docfix: Fix typo in Installation doc by @yifanmai in #1181
New Contributors
1.12.3
Main changes
-
New option to use multiple templates and/or num_demos in single dataset recipe. Unitxt will randomly sample from the provided templates and possible number of demos for each instance.
See example : https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_templates_num_demos.py -
A warning is now generated when a metric generate a score with the same name as that of another metric and overwrites it
See more details on how to deal with conflicting metric names in https://www.unitxt.ai/en/latest/docs/adding_metric.html#metric-outputs-with-multiple-metrics
Non backward compatible changes in catalog
- change rag metrics name convention (e.g. "metrics.rag.mrr" -> "metrics.rag.context_correctness.mrr",) - catalog non backward compatible change by @assaftibm in #1104
- Update summarization task and templates to support multiple reference summaries - by @yoavkatz in #1126
- Fix belebele due to new convention by @elronbandel in #1145
Additions to catalog
- Add DeepSeek-Coder format and system prompt by @oktie in #1105
- Add a metric to calculate the ratio of references included in the prediction by @marukaz in #1091
- adding RAG bge metrics by @assaftibm
New Features
- Add option to run multiple templates and or num_demos in single dataset recipe. Now it is possible to give a list of templates or num_demos. Unitxt will randomly sample from the templates and for each instance assign a random template from the list. by @elronbandel in #1110
- A warning is now generated when a metric generate a score with the same name as that of another metric and overwrites it @dafnapension in #1124
- MetricPipeline fields postpreprocess_steps has been renamed to postprocess_steps. The old field (postpreprocess_steps) still exists for backward compatible but depricated. by @dafnapension in #1117
- Decrease runtime of demo examples
- Add tests for RAG metrics by @matanor
- Adding dedicated Unitxt warning and error classes to link online documentation by @yoavkatz in
- The code now uses a central controllable deepcopy function by @elronbandel in #1120
Bug Fixes
- Create a dedicated nltk a mixin, for downloading all versions of punkt which needed by metrics code. by @elronbandel in #1151
- For bulk instance metrics, Replace mean function with nanmean to support aggregation in case of nan scores. by @elronbandel in #1150
- Fix helm test by @elronbandel in #1109
- Fix bug with RAG metrics: Fix use of minilm model by @assaftibm in #1115
- Fix data classification of WML model to include 'public' classification by @yoavkatz in #1118
- Fix WMLInferenceEngine by @pawelknes in #1122
- Fix belebele HF path due to new convention by @elronbandel in #1145
Documentation changes
- Improve debugging.rst wording
- Improve examples.rst wording by @welisheva22 in #1138
- Improve data_classification_policy.rst wording by @welisheva22 in #1139
- Improve rag_support.rst wording by @welisheva22 in #1139
- Improve production.rst wording by @welisheva22 in #1148
- Improve the clarity of the code examples.
- Improve load_datasets.rst wording by @welisheva22
- Improve introduction.rst wording by @welisheva22
- Improve installation.rst wording by @welisheva22
- Improve adding_format.rst wording by @welisheva22
- Improve adding_task.rst wording by @welisheva22
- Improve adding_template.rst wording by @welisheva22
- mprove adding_dataset.rst wording by @hanansinger
- improve index.rst page by @yoavkatz
- Fix link to llama blog in adding_format.rst by @andersonm-ibm in #1113
- Added example of RAG response by @yoavkatz in #1121
New Contributors
- @andersonm-ibm made their first contribution in #1113 by @welisheva22 in #1152
Unitxt 1.12.2
Main changes
- Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed) - Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
- Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
- Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
- For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
- Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py
Non backward compatible changes
- changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
- Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
- Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034
Changes in Catalog
- safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
- Remove financebench card since it was removed from HF by @elronbandel in #1016
- add validation to tldr, remove shuffle from billsum by @alonh in #1038
- Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
- numeric nlg dataset template changes by @ShirApp in #1041
Additions to catalog
- Arena hard elad2 by @eladven and @OfirArviv in #1026
- Add flores101 by @perlitz in #1053
- Add metric "metrics.rag.retrieval_at_k" to catalog by @matanor in #1074
- Add Finqa dataset by @ShirApp in #962
- Allow rag context_id fields to be List[str] and not only List[int] by @perlitz in #1036
- Rag end to end task support (in progress) - by @benjaminsznajder in #1044, #1080
New Features
- Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by @luisaadanttas in #994
- Support for ensemble by metrics @eladven in #1047
- Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by @pawelknes in #1019 @pawelknes in #1024
- Real types in tasks and metrics by @elronbandel in #1045
- Ability to create demo samplers based on instance by @yoavkatz in #1034
- add judge input to the LLM as Judge metric scores by @OfirArviv in #1064
Bug Fixes
- Solve problem with striping format at LLM as a judge code. by @eladven in #1005
- Added seed to LLM as judges for consistent results by @yoavkatz in #1029
- Fixed issues with fresh install by @yoavkatz in #1037
- WML Inference Engine fix by @pawelknes in #1013
- replace type and type in type error message by @perlitz in #1035
- FinQA - filter problematic examples by @ShirApp in #1039
- demo's target prefix is now taken from demo instance by @dafnapension in #1031
- Make sure preparation times printed fully and nicely by @elronbandel in #1046
- Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
- Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
- Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
- Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
- Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
- Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
- fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055
Documentation changes
- Update error message and documentation on unitxt local and HF version conflict by @yoavkatz in #995
- Update llm_as_judge.rst by @yoavkatz in #1085
- Update introduction.rst add the word "a" before "variety" by @welisheva22 in #1015
- Example improvements by @yoavkatz in #1022
- Add a guide for using unitxt with lm-evaluation-harness by @elronbandel in #1020
- Fix some docs titles and links by @elronbandel in #1023
- Add example of meta evaluation of llm as judge by @yoavkatz in #1025
- Update introduction.rst - - copy edits (grammar, consistency, clarity) by @welisheva22 in #1063
- Added example for selection of demos by @yoavkatz in #1052
New Contributors
We want to thank the new contributors for their first contributions!
- @welisheva22 made their first contribution in #1015
- @luisaadanttas made their first contribution in #994
- @benjaminsznajder made their first contribution in #1055
- @hanansinger made their first contribution in #1057
Unitxt 1.12.0
Main changes
- Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed) - Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
- Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
- Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
- For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
- Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py
Non backward compatible changes
- changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
- Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
- Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034
Changes in Catalog
- safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
- Remove financebench card since it was removed from HF by @elronbandel in #1016
- add validation to tldr, remove shuffle from billsum by @alonh in #1038
- Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
- numeric nlg dataset template changes by @ShirApp in #1041
Additions to catalog
- Arena hard elad2 by @eladven and @OfirArviv in #1026
- Add flores101 by @perlitz in #1053
- Add metric "metrics.rag.retrieval_at_k" to catalog by @matanor in #1074
- Add Finqa dataset by @ShirApp in #962
- Allow rag context_id fields to be List[str] and not only List[int] by @perlitz in #1036
- Rag end to end task support (in progress) - by @benjaminsznajder in #1044, #1080
New Features
- Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by @luisaadanttas in #994
- Support for ensemble by metrics @eladven in #1047
- Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by @pawelknes in #1019 @pawelknes in #1024
- Real types in tasks and metrics by @elronbandel in #1045
- Ability to create demo samplers based on instance by @yoavkatz in #1034
- add judge input to the LLM as Judge metric scores by @OfirArviv in #1064
Bug Fixes
- Solve problem with striping format at LLM as a judge code. by @eladven in #1005
- Added seed to LLM as judges for consistent results by @yoavkatz in #1029
- Fixed issues with fresh install by @yoavkatz in #1037
- WML Inference Engine fix by @pawelknes in #1013
- replace type and type in type error message by @perlitz in #1035
- FinQA - filter problematic examples by @ShirApp in #1039
- demo's target prefix is now taken from demo instance by @dafnapension in #1031
- Make sure preparation times printed fully and nicely by @elronbandel in #1046
- Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
- Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
- Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
- Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
- Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
- Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
- fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055
Documentation changes
- Update error message and documentation on unitxt local and HF version conflict by @yoavkatz in #995
- Update llm_as_judge.rst by @yoavkatz in #1085
- Update introduction.rst add the word "a" before "variety" by @welisheva22 in #1015
- Example improvements by @yoavkatz in #1022
- Add a guide for using unitxt with lm-evaluation-harness by @elronbandel in #1020
- Fix some docs titles and links by @elronbandel in #1023
- Add example of meta evaluation of llm as judge by @yoavkatz in #1025
- Update introduction.rst - - copy edits (grammar, consistency, clarity) by @welisheva22 in #1063
- Added example for selection of demos by @yoavkatz in #1052
New Contributors
We want to thank the new contributors for their first contributions!
- @welisheva22 made their first contribution in #1015
- @luisaadanttas made their first contribution in #994
- @benjaminsznajder made their first contribution in #1055
- @hanansinger made their first contribution in #1057
1.11.1
Non backward compatible changes
- The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
- fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in
New Features
- Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
- Add option for lazy load hf inference engine by @elronbandel in #980
- Added a format based on Huggingface format by @yoavkatz in #988
New Assets
- Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956
Bug Fixes
- Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978
Documentation
- Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
- Improve the examples table documentation by @eladven in #976
Refactoring
- Delete empty metrics folder by @elronbandel in #984
Testing and CI/CD
New Contributors
- @lga-zurich made their first contribution in #978
Full Changelog: 1.10.1...1.10.2
1.11.0 (#996)
Non backward compatible changes
- The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
- fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in
New Features
- Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
- Add option for lazy load hf inference engine by @elronbandel in #980
- Added a format based on Huggingface format by @yoavkatz in #988
New Assets
- Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956
Bug Fixes
- Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978
Documentation
- Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
- Improve the examples table documentation by @eladven in #976
Refactoring
- Delete empty metrics folder by @elronbandel in #984
Testing and CI/CD
New Contributors
- @lga-zurich made their first contribution in #978
Full Changelog: 1.10.1...1.10.2
1.10.3
Non backward compatible changes
- The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
- fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in
New Features
- Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
- Add option for lazy load hf inference engine by @elronbandel in #980
- Added a format based on Huggingface format by @yoavkatz in #988
New Assets
- Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956
Bug Fixes
- Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978
Documentation
- Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
- Improve the examples table documentation by @eladven in #976
Refactoring
- Delete empty metrics folder by @elronbandel in #984
Testing and CI/CD
New Contributors
- @lga-zurich made their first contribution in #978
Full Changelog: 1.10.1...1.10.2
1.10.2
Non backward compatible changes
- None - this release if fully compatible with the previous release.
New Features
- added num_proc parameter - Optional integer to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
- Add option to lazy load hf inference engine and fix requirements mechanism by @elronbandel in #980
- Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956
- Add metrics: domesticated safety and regard by @dafnapension in #983
- Make input_format required field in InputOutputTemplate by @elronbandel in #982
- Added a format based on Huggingface format by @yoavkatz in #988
Bug Fixes
- Fix the error at the examples table by @eladven in #976
- fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. by @matanor in #969
- Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978
Documentation
- Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
Refactoring
- Delete empty metrics folder by @elronbandel in #984
Testing and CI/CD
New Contributors
- @lga-zurich made their first contribution in #978
Full Changelog: 1.10.1...1.10.2
1.10.1
Main Changes
- Continued with major improvements to the documentation including a new code examples section with standalone python code that shows how to perform evaluation, add new datasets, compare formats, use LLM as judges , and more. Cards for datasets from huggingface have detailed descriptions. New documentation of RAG tasks and metrics.
load_dataset
can now load cards defined in a python file (and not only in the catalog). See example.- The evaluation results returned from
evaluate
now include two fieldspredictions
andprocessed_predictions
. See example. - The fields can have defaults, so if they are not specified in the card, they get a default value. For example, multi-class classification has
text
as the defaulttext_type
. See example.
Non backward compatible changes
You need to recreate the any cards/metrics you added by running prepare//.py file. You can create all cards simply by running python utils/prepare_all_artifacts.py . This will avoid the type error.
The AddFields operator was renamed Set and CopyFields operator was renamed Copy. Note previous code should continue to work, but we renamed all existing code in the unitxt and fm-eval repos.
- Change Artifact.type to Artifact.type by @elronbandel in #933
- change CopyFields operators name to Copy by @duckling69 in #876
- Rename AddFields to Set, a name that represent its role better and concisely by @elronbandel in #903
New Features
- Allow eager execution by @elronbandel in #888
- Add view option for Task definitions in UI explorer. by @yoavkatz in #891
- Add input type checking in LoadFromDictionary by @yoavkatz in #900
- Add TokensSlice operator by @elronbandel in #902
- Make some logs critical by @elronbandel in #973
- Add LogProbInferenceEngines API and implement for OpenAI by @lilacheden in #909
- Added support for ibm-watsonx-ai inference by @pawelknes in #961
- load_dataset supports loading cards not present in local catalog by @pawelknes in #929
- Added defaults to tasks by @pawelknes in #921
- Add raw predictions and references to results by @yoavkatz in #934
- Allow add-hoc metrics and template (and Add first version of standalone example of dataset with LLM as a judge ) by @eladven in #922
- Add infer() function for end to end inference pipeline by @elronbandel in #952
Bug Fixes
- LLMaaJ implementation of MLCommons' simple-safety-tests by @bnayahu in #873
- Update gradio version on website by @elronbandel in #896
- Improve demo by @elronbandel in #898
- Fix demo and organize files by @elronbandel in #897
- Make sacrebleu robust by @yoavkatz in #892
- Fix huggingface assets to have versions and up to date readme by @elronbandel in #895
- fix(cos loader): account for slashes in cos file name by @jezekra1 in #904
- llama3 instruct and chat system prompts by @oktie in #950
- Added trust_remote_code to HF dataset query operations by @yoavkatz in #911
Documentation
- Update llm_as_judge.rst by @yoavkatz in #970
- Michal Jacovi's completed manual review of the card descriptions by @dafnapension in #883
- In card preparers, generate the tags with "singletons" rather than values paired with True by @dafnapension in #874
- Improved documentation by @yoavkatz in #886
- Update glossary.rst by @yoavkatz in #899
- Add example section to documentation by @yoavkatz in #917
- Added example of open qa using catalog by @yoavkatz in #919
- Update example intro and simplified WNLI cards by @yoavkatz in #923
- Update adding_metric.rst by @yoavkatz in #955
- RAG documentation by @yoavkatz in #928
- docs: update adding_dataset.rst by @eltociear in #927
- prepare for description= that is different from those embedded automtically by @dafnapension in #937
- Add simple LLM as a judge example, of using it without installaiotn by @eladven in #968
- Add example of using LLM as a judge for summarization dataset. by @eladven in #965
- Improve operators documentation by @elronbandel in #942
New Assets
- Add numeric nlg dataset by @ShirApp in #882
- Add to_list_by_hyphen_space processor by @marukaz in #872
- Added tags and descriptions to safety cards by @bnayahu in #887
- Add Mt-Bench datasets + add operators by @OfirArviv in #870
- Touch up numeric nlg by @elronbandel in #889
- split train to train and validation sets in billsum by @alonh in #901
- modified wikitq, tab_fact taskcards by @ShirApp in #963
Implementation of TruthfulQA by @bnayahu in #931 - Add bluebench cards by @perlitz in #918
- Add LlamaIndex faithfulness metric by @arielge in #971
- Expanded template support for safety cards by @bnayahu in #943
Testing and CI/CD
- Add end to end realistic test to fusion by @elronbandel in #940
- Moved test_examples to run the actual examples by @yoavkatz in #913
- Use uv for installing requirements in actions by @elronbandel in #960
- Add ability to print_dict to print selected fields by @yoavkatz in #947
- Get rid of pkg_resources dependency by @elronbandel in #932
- adapt filtering lambda to datasets 2.20 by @dafnapension in #930
- Increase preparation log to error. by @elronbandel in #959
New Contributors
Full Changelog: 1.10.0...1.10.1