Release Unitxt 1.12.0 · IBM/unitxt

Main changes

Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed)
Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py

changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034

safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
Remove financebench card since it was removed from HF by @elronbandel in #1016
add validation to tldr, remove shuffle from billsum by @alonh in #1038
Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
numeric nlg dataset template changes by @ShirApp in #1041

Arena hard elad2 by @eladven and @OfirArviv in #1026
Add flores101 by @perlitz in #1053
Add metric "metrics.rag.retrieval_at_k" to catalog by @matanor in #1074
Add Finqa dataset by @ShirApp in #962
Allow rag context_id fields to be List[str] and not only List[int] by @perlitz in #1036
Rag end to end task support (in progress) - by @benjaminsznajder in #1044, #1080

Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by @luisaadanttas in #994
Support for ensemble by metrics @eladven in #1047
Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by @pawelknes in #1019 @pawelknes in #1024
Real types in tasks and metrics by @elronbandel in #1045
Ability to create demo samplers based on instance by @yoavkatz in #1034
add judge input to the LLM as Judge metric scores by @OfirArviv in #1064

Solve problem with striping format at LLM as a judge code. by @eladven in #1005
Added seed to LLM as judges for consistent results by @yoavkatz in #1029
Fixed issues with fresh install by @yoavkatz in #1037
WML Inference Engine fix by @pawelknes in #1013
replace type and type in type error message by @perlitz in #1035
FinQA - filter problematic examples by @ShirApp in #1039
demo's target prefix is now taken from demo instance by @dafnapension in #1031
Make sure preparation times printed fully and nicely by @elronbandel in #1046
Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055

Update error message and documentation on unitxt local and HF version conflict by @yoavkatz in #995
Update llm_as_judge.rst by @yoavkatz in #1085
Update introduction.rst add the word "a" before "variety" by @welisheva22 in #1015
Example improvements by @yoavkatz in #1022
Add a guide for using unitxt with lm-evaluation-harness by @elronbandel in #1020
Fix some docs titles and links by @elronbandel in #1023
Add example of meta evaluation of llm as judge by @yoavkatz in #1025
Update introduction.rst - - copy edits (grammar, consistency, clarity) by @welisheva22 in #1063
Added example for selection of demos by @yoavkatz in #1052

We want to thank the new contributors for their first contributions!