Skip to content

Releases: IBM/unitxt

Unitxt 1.13.0

25 Sep 06:37
80b284f
Compare
Choose a tag to compare

Unitxt 1.13.0 - Multi Modality and Types

New type handling capabilities

The most significant change in this release is the introduction of type serializers to unitxt.
Type serializers in charge of taking a specific type of data structure such as Table, or Dialog and serialize it to textual representation.
Now you can define tasks in unitxt that have complex types such as Table or Dialog and define serializers that handle their transformation to text.

This allows to control the representation of different types from the recipe api:

from unitxt import load_dataset
from unitxt.struct_data_operators import SerializeTableAsMarkdown

serializer = SerializeTableAsMarkdown(shuffle_rows=True, seed=0)
dataset = load_dataset(card="cards.wikitq", template_card_index=0, serializer=serializer)

And if you want to serialize this table differently you can change any of the many available table serializers.

Defining New Type

If you wish to define a new type with custom serializers you can do so by using python typing library:

from typing import Any, List, TypedDict

class Table(TypedDict):
    header: List[str]
    rows: List[List[Any]]

Once your type is ready you should register it to unitxt type handling within the code you are running:

from unitxt.type_utils import register_type

register_type(Table)

Now your type can be used anywhere across unitxt (e.g in task definition or serializers).

Defining a Serializer For a Type

If you want to define a serializer for your custom type or any typing type combination you can do so by:

class MySerizlizer(SingleTypeSerializer):
    serialized_type = Table
    def serialize(self, value: Table, instance: Dict[str, Any]) -> str:
        # your code to turn value of type Table to string

Multi-Modality

You now can process Image-Text to Text or Image-Audio to Text datasets in unitxt.
For example if you want to load the doc-vqa dataset you can do so by:

from unitxt import load_dataset

dataset = load_dataset(
    card="cards.doc_vqa.en",
    template="templates.qa.with_context.title",
    format="formats.models.llava_interleave",
    loader_limit=20,
)

Since we have data augmentation mechanisms it is just natural to use it for images. For example if you want your images in grey scale:

dataset = load_dataset(
    card="cards.doc_vqa.en",
    template="templates.qa.with_context.title",
    format="formats.models.llava_interleave",
    loader_limit=20,
    augmentor="augmentors.image.grey_scale", # <= Just like the text augmenters!
)

Then if you want to get the scores of a model on this dataset you can use:

from unitxt.inference import HFLlavaInferenceEngine
from unitxt.text_utils import print_dict
from unitxt import evaluate

inference_model = HFLlavaInferenceEngine(
    model_name="llava-hf/llava-interleave-qwen-0.5b-hf", max_new_tokens=32
)

test_dataset = dataset["test"].select(range(5))

predictions = inference_model.infer(test_dataset)
evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)

print_dict(
    evaluated_dataset[0],
    keys_to_print=["source", "media", "references", "processed_prediction", "score"],
)

Multi modality support in unitxt is building upon the type handling introduced in the previous section with two new types: Image and Audio.

What's Changed

New Contributors

Full Changelog: 1.12.4...1.13.0

1.12.4

28 Aug 13:17
1a97cce
Compare
Choose a tag to compare

Main changes

Additions to catalog

New Features

  • Add LLM as judge ensemble metrics, and add LLMaaJ ensemble example by @pvn25 in #1081
  • Refactor RenameFields operator to Rename. The old operator is still supported but raises a deprecation warning by @elronbandel in #1123

Bug Fixes

Documentation changes

  • Update llm_as_judge.py --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1164
  • Update formats.py --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1163
  • Update loaders.py --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1162
  • Update card.py - minor documentation changes by @welisheva22 in #1161
  • Update adding_dataset.rst - a few more minor documentation changes by @welisheva22 in #1160
  • Update artifact.py --- documentation edits (grammar, consistency, cla… by @welisheva22 in #1159
  • Update glossary.rst --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1155
  • Update helm.rst --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1154
  • Update operators.py --- copy edits (grammar, consistency, clarity) - take 2 by @welisheva22 in #1158
  • Docfix: Fix typo in Installation doc by @yifanmai in #1181

New Contributors

1.12.3

15 Aug 23:01
8fd91be
Compare
Choose a tag to compare

Main changes

Non backward compatible changes in catalog

  • change rag metrics name convention (e.g. "metrics.rag.mrr" -> "metrics.rag.context_correctness.mrr",) - catalog non backward compatible change by @assaftibm in #1104
  • Update summarization task and templates to support multiple reference summaries - by @yoavkatz in #1126
  • Fix belebele due to new convention by @elronbandel in #1145

Additions to catalog

  • Add DeepSeek-Coder format and system prompt by @oktie in #1105
  • Add a metric to calculate the ratio of references included in the prediction by @marukaz in #1091
  • adding RAG bge metrics by @assaftibm

New Features

  • Add option to run multiple templates and or num_demos in single dataset recipe. Now it is possible to give a list of templates or num_demos. Unitxt will randomly sample from the templates and for each instance assign a random template from the list. by @elronbandel in #1110
  • A warning is now generated when a metric generate a score with the same name as that of another metric and overwrites it @dafnapension in #1124
  • MetricPipeline fields postpreprocess_steps has been renamed to postprocess_steps. The old field (postpreprocess_steps) still exists for backward compatible but depricated. by @dafnapension in #1117
  • Decrease runtime of demo examples
  • Add tests for RAG metrics by @matanor
  • Adding dedicated Unitxt warning and error classes to link online documentation by @yoavkatz in
  • The code now uses a central controllable deepcopy function by @elronbandel in #1120

Bug Fixes

  • Create a dedicated nltk a mixin, for downloading all versions of punkt which needed by metrics code. by @elronbandel in #1151
  • For bulk instance metrics, Replace mean function with nanmean to support aggregation in case of nan scores. by @elronbandel in #1150
  • Fix helm test by @elronbandel in #1109
  • Fix bug with RAG metrics: Fix use of minilm model by @assaftibm in #1115
  • Fix data classification of WML model to include 'public' classification by @yoavkatz in #1118
  • Fix WMLInferenceEngine by @pawelknes in #1122
  • Fix belebele HF path due to new convention by @elronbandel in #1145

Documentation changes

New Contributors

Unitxt 1.12.2

31 Jul 14:46
ce2992c
Compare
Choose a tag to compare

Main changes

Non backward compatible changes

  • changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
  • Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
  • Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034

Changes in Catalog

  • safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
  • Remove financebench card since it was removed from HF by @elronbandel in #1016
  • add validation to tldr, remove shuffle from billsum by @alonh in #1038
  • Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
  • numeric nlg dataset template changes by @ShirApp in #1041

Additions to catalog

New Features

Bug Fixes

  • Solve problem with striping format at LLM as a judge code. by @eladven in #1005
  • Added seed to LLM as judges for consistent results by @yoavkatz in #1029
  • Fixed issues with fresh install by @yoavkatz in #1037
  • WML Inference Engine fix by @pawelknes in #1013
  • replace type and type in type error message by @perlitz in #1035
  • FinQA - filter problematic examples by @ShirApp in #1039
  • demo's target prefix is now taken from demo instance by @dafnapension in #1031
  • Make sure preparation times printed fully and nicely by @elronbandel in #1046
  • Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
  • Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
  • Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
  • Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
  • Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
  • Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
  • fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055

Documentation changes


New Contributors

We want to thank the new contributors for their first contributions!

Unitxt 1.12.0

31 Jul 12:25
Compare
Choose a tag to compare

Main changes

Non backward compatible changes

  • changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
  • Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
  • Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034

Changes in Catalog

  • safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
  • Remove financebench card since it was removed from HF by @elronbandel in #1016
  • add validation to tldr, remove shuffle from billsum by @alonh in #1038
  • Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
  • numeric nlg dataset template changes by @ShirApp in #1041

Additions to catalog

New Features

Bug Fixes

  • Solve problem with striping format at LLM as a judge code. by @eladven in #1005
  • Added seed to LLM as judges for consistent results by @yoavkatz in #1029
  • Fixed issues with fresh install by @yoavkatz in #1037
  • WML Inference Engine fix by @pawelknes in #1013
  • replace type and type in type error message by @perlitz in #1035
  • FinQA - filter problematic examples by @ShirApp in #1039
  • demo's target prefix is now taken from demo instance by @dafnapension in #1031
  • Make sure preparation times printed fully and nicely by @elronbandel in #1046
  • Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
  • Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
  • Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
  • Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
  • Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
  • Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
  • fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055

Documentation changes


New Contributors

We want to thank the new contributors for their first contributions!

1.11.1

08 Jul 05:52
b23fb42
Compare
Choose a tag to compare

Non backward compatible changes

  • The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
  • fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

  • Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
  • Add option for lazy load hf inference engine by @elronbandel in #980
  • Added a format based on Huggingface format by @yoavkatz in #988

New Assets

  • Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Documentation

  • Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
  • Improve the examples table documentation by @eladven in #976

Refactoring

Testing and CI/CD

New Contributors

Full Changelog: 1.10.1...1.10.2

1.11.0 (#996)

07 Jul 11:32
306fc50
Compare
Choose a tag to compare

Non backward compatible changes

  • The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
  • fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

  • Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
  • Add option for lazy load hf inference engine by @elronbandel in #980
  • Added a format based on Huggingface format by @yoavkatz in #988

New Assets

  • Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Documentation

  • Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
  • Improve the examples table documentation by @eladven in #976

Refactoring

Testing and CI/CD

New Contributors

Full Changelog: 1.10.1...1.10.2

1.10.3

04 Jul 08:23
Compare
Choose a tag to compare

Non backward compatible changes

  • The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
  • fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

  • Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
  • Add option for lazy load hf inference engine by @elronbandel in #980
  • Added a format based on Huggingface format by @yoavkatz in #988

New Assets

  • Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Documentation

  • Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
  • Improve the examples table documentation by @eladven in #976

Refactoring

Testing and CI/CD

New Contributors

Full Changelog: 1.10.1...1.10.2

1.10.2

04 Jul 06:17
97243ad
Compare
Choose a tag to compare

Non backward compatible changes

  • None - this release if fully compatible with the previous release.

New Features

  • added num_proc parameter - Optional integer to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
  • Add option to lazy load hf inference engine and fix requirements mechanism by @elronbandel in #980
  • Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956
  • Add metrics: domesticated safety and regard by @dafnapension in #983
  • Make input_format required field in InputOutputTemplate by @elronbandel in #982
  • Added a format based on Huggingface format by @yoavkatz in #988

Bug Fixes

  • Fix the error at the examples table by @eladven in #976
  • fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. by @matanor in #969
  • Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978

Documentation

  • Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981

Refactoring

Testing and CI/CD

New Contributors

Full Changelog: 1.10.1...1.10.2

1.10.1

01 Jul 08:04
59b0a62
Compare
Choose a tag to compare

Main Changes

  • Continued with major improvements to the documentation including a new code examples section with standalone python code that shows how to perform evaluation, add new datasets, compare formats, use LLM as judges , and more. Cards for datasets from huggingface have detailed descriptions. New documentation of RAG tasks and metrics.
  • load_dataset can now load cards defined in a python file (and not only in the catalog). See example.
  • The evaluation results returned from evaluate now include two fields predictions and processed_predictions. See example.
  • The fields can have defaults, so if they are not specified in the card, they get a default value. For example, multi-class classification has text as the default text_type. See example.

Non backward compatible changes

You need to recreate the any cards/metrics you added by running prepare//.py file. You can create all cards simply by running python utils/prepare_all_artifacts.py . This will avoid the type error.

The AddFields operator was renamed Set and CopyFields operator was renamed Copy. Note previous code should continue to work, but we renamed all existing code in the unitxt and fm-eval repos.

New Features

Bug Fixes

Documentation

New Assets

Testing and CI/CD

New Contributors

Full Changelog: 1.10.0...1.10.1