Chore/rename task fields (#994)

* chore: Rename Task inputs and outputs fields Signed-off-by: luisaadanttas <maria.luisa.dantas@ccc.ufcg.edu.br> * docs: Rename Task inputs and outputs fields Signed-off-by: luisaadanttas <maria.luisa.dantas@ccc.ufcg.edu.br> * chore: update remaining input_fields and reference_fields in Tasks Signed-off-by: luisaadanttas <maria.luisa.dantas@ccc.ufcg.edu.br> * refactor: handle deprecated input/output fields and add prepare method for compatibility Signed-off-by: luisaadanttas <maria.luisa.dantas@ccc.ufcg.edu.br> * test: add tests for deprecated inputs/outputs and conflicting fields in Task Signed-off-by: luisaadanttas <maria.luisa.dantas@ccc.ufcg.edu.br> * test: update tests for task initialization with detailed field checks Signed-off-by: luisaadanttas <maria.luisa.dantas@ccc.ufcg.edu.br> * refactor: separate checks for input_fields and reference_fields Signed-off-by: luisaadanttas <maria.luisa.dantas@ccc.ufcg.edu.br> * fix:update field names in atta_q, attaq_500, and bold cards Signed-off-by: luisaadanttas <maria.luisa.dantas@ccc.ufcg.edu.br> --------- Signed-off-by: luisaadanttas <maria.luisa.dantas@ccc.ufcg.edu.br> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
IBM · Jul 17, 2024 · 40d0a96 · 40d0a96
1 parent 9ef0db7
commit 40d0a96
Show file tree

Hide file tree

Showing 108 changed files with 478 additions and 334 deletions.
diff --git a/docs/docs/adding_dataset.rst b/docs/docs/adding_dataset.rst
@@ -29,8 +29,8 @@ an Engish to French translation task or for a French to English translation task
 
 The Task schema is a formal definition of the NLP task , including its inputs, outputs, and default evaluation metrics.
 
-The `inputs` of the task are a set of fields that are used to format the textual input to the model.
-The `output` of the task are a set of fields that are used to format the textual expected output from the model (gold references).
+The `input_fields` of the task are a set of fields that are used to format the textual input to the model.
+The `reference_fields` of the task are a set of fields that are used to format the textual expected output from the model (gold references).
 The `metrics` of the task are a set of default metrics to be used to evaluate the outputs of the model.
 
 While language models generate textual predictions, the metrics often evaluate on a different datatypes.  For example,
@@ -46,8 +46,8 @@ We will use the `bleu` metric for a reference based evaluation.
 .. code-block:: python
 
     task=Task(
-        inputs= { "text" : "str", "source_language" : "str", "target_language" : "str"},
-        outputs= {"translation" : "str"},
+        input_fields= { "text" : "str", "source_language" : "str", "target_language" : "str"},
+        reference_fields= {"translation" : "str"},
         prediction_type="str",
         metrics=["metrics.bleu"],
     ),

diff --git a/docs/docs/adding_metric.rst b/docs/docs/adding_metric.rst
@@ -19,8 +19,8 @@ For example:
 
    .. code-block:: python
    task = Task(
-        inputs={ "question" : "str" },
-        outputs={ "answer" : str },
+        input_fields={ "question" : "str" },
+        reference_fields={ "answer" : str },
         prediction_type="str",
         metrics=[
             "metrics.rouge",

diff --git a/docs/docs/adding_task.rst b/docs/docs/adding_task.rst
@@ -13,8 +13,8 @@ Tasks are fundamental to Unitxt, acting as standardized interface for integratin
 
 The Task schema is a formal definition of the NLP task, including its inputs, outputs, and default evaluation metrics.
 
-The `inputs` of the task are a set of fields that are used to format the textual input to the model.
-The `output` of the task are a set of fields that are used to format the expected textual output from the model (gold references).
+The `input_fields` of the task are a set of fields that are used to format the textual input to the model.
+The `reference_fields` of the task are a set of fields that are used to format the expected textual output from the model (gold references).
 The `metrics` of the task are a set of default metrics to be used to evaluate the outputs of the model.
 
 As an example, consider an evaluation task for LLMs to evaluate how well they are able to calculate the sum of two integer numbers.
@@ -25,8 +25,8 @@ The task is formally defined as:
    from unitxt.blocks import Task
 
    task = Task(
-        inputs={"num1" : "int", "num2" : "int"},
-        outputs={"sum" : "int"},
+        input_fields={"num1" : "int", "num2" : "int"},
+        reference_fields={"sum" : "int"},
         prediction_type="int",
         metrics=[
             "metrics.sum_accuracy",

diff --git a/examples/standalone_evaluation_llm_as_judge.py b/examples/standalone_evaluation_llm_as_judge.py
@@ -56,8 +56,8 @@
 card = TaskCard(
     loader=LoadFromDictionary(data=data),
     task=Task(
-        inputs={"question": "str"},
-        outputs={"answer": "str"},
+        input_fields={"question": "str"},
+        reference_fields={"answer": "str"},
         prediction_type="str",
         metrics=[llm_judge_metric],
     ),

diff --git a/examples/standalone_qa_evaluation.py b/examples/standalone_qa_evaluation.py
@@ -24,8 +24,8 @@
     loader=LoadFromDictionary(data=data),
     # Define the QA task input and output and metrics.
     task=Task(
-        inputs={"question": "str"},
-        outputs={"answer": "str"},
+        input_fields={"question": "str"},
+        reference_fields={"answer": "str"},
         prediction_type="str",
         metrics=["metrics.accuracy"],
     ),

diff --git a/prepare/cards/atta_q.py b/prepare/cards/atta_q.py
@@ -23,7 +23,9 @@
         DumpJson(field="input_label"),
     ],
     task=Task(
-        inputs=["input"], outputs=["input_label"], metrics=["metrics.safety_metric"]
+        input_fields=["input"],
+        reference_fields=["input_label"],
+        metrics=["metrics.safety_metric"],
     ),
     templates=TemplatesList(
         [

diff --git a/prepare/cards/attaq_500.py b/prepare/cards/attaq_500.py
@@ -527,7 +527,9 @@
         DumpJson(field="input_label"),
     ],
     task=Task(
-        inputs=["input"], outputs=["input_label"], metrics=["metrics.safety_metric"]
+        input_fields=["input"],
+        reference_fields=["input_label"],
+        metrics=["metrics.safety_metric"],
     ),
     templates=TemplatesList(
         [

diff --git a/prepare/cards/bold.py b/prepare/cards/bold.py
@@ -35,8 +35,8 @@
         DumpJson(field="input_label"),
     ],
     task=Task(
-        inputs=["first_prompt"],
-        outputs=["input_label"],
+        input_fields=["first_prompt"],
+        reference_fields=["input_label"],
         metrics=["metrics.regard_metric"],
     ),
     templates=TemplatesList(

diff --git a/prepare/cards/human_eval.py b/prepare/cards/human_eval.py
@@ -26,8 +26,8 @@
         )
     ],
     task=Task(
-        inputs=["prompt"],
-        outputs=["prompt", "canonical_solution", "test_list"],
+        input_fields=["prompt"],
+        reference_fields=["prompt", "canonical_solution", "test_list"],
         metrics=["metrics.bleu"],
     ),
     templates=TemplatesList(

diff --git a/prepare/cards/mbpp.py b/prepare/cards/mbpp.py
@@ -17,8 +17,8 @@
         JoinStr(field_to_field={"test_list": "test_list_str"}, separator=os.linesep),
     ],
     task=Task(
-        inputs=["text", "test_list_str"],
-        outputs=["test_list", "code"],
+        input_fields=["text", "test_list_str"],
+        reference_fields=["test_list", "code"],
         metrics=["metrics.bleu"],
     ),
     templates=TemplatesList(

diff --git a/prepare/cards/mrpc.py b/prepare/cards/mrpc.py
@@ -31,8 +31,8 @@
         ),
     ],
     task=Task(
-        inputs=["choices", "sentence1", "sentence2"],
-        outputs=["label"],
+        input_fields=["choices", "sentence1", "sentence2"],
+        reference_fields=["label"],
         metrics=["metrics.accuracy"],
     ),
     templates=TemplatesList(

diff --git a/prepare/cards/pop_qa.py b/prepare/cards/pop_qa.py
@@ -17,8 +17,8 @@
         LoadJson(field="possible_answers"),
     ],
     task=Task(
-        inputs=["question", "prop", "subj"],
-        outputs=["possible_answers"],
+        input_fields=["question", "prop", "subj"],
+        reference_fields=["possible_answers"],
         metrics=["metrics.accuracy"],
     ),
     templates=TemplatesList(

diff --git a/prepare/cards/qqp.py b/prepare/cards/qqp.py
@@ -24,8 +24,8 @@
         ),
     ],
     task=Task(
-        inputs=["choices", "question1", "question2"],
-        outputs=["label"],
+        input_fields=["choices", "question1", "question2"],
+        reference_fields=["label"],
         metrics=["metrics.accuracy"],
     ),
     templates=TemplatesList(

diff --git a/prepare/cards/wsc.py b/prepare/cards/wsc.py
@@ -22,8 +22,8 @@
         ),
     ],
     task=Task(
-        inputs=["choices", "text", "span1_text", "span2_text"],
-        outputs=["label"],
+        input_fields=["choices", "text", "span1_text", "span2_text"],
+        reference_fields=["label"],
         metrics=["metrics.accuracy"],
     ),
     templates=TemplatesList(

diff --git a/prepare/operators/balancers/per_task.py b/prepare/operators/balancers/per_task.py
@@ -5,27 +5,27 @@
     MinimumOneExamplePerLabelRefiner,
 )
 
-balancer = DeterministicBalancer(fields=["outputs/label"])
+balancer = DeterministicBalancer(fields=["reference_fields/label"])
 
 add_to_catalog(balancer, "operators.balancers.classification.by_label", overwrite=True)
 
-balancer = DeterministicBalancer(fields=["outputs/answer"])
+balancer = DeterministicBalancer(fields=["reference_fields/answer"])
 
 add_to_catalog(balancer, "operators.balancers.qa.by_answer", overwrite=True)
 
-balancer = LengthBalancer(fields=["outputs/labels"], segments_boundaries=[1])
+balancer = LengthBalancer(fields=["reference_fields/labels"], segments_boundaries=[1])
 
 add_to_catalog(
     balancer, "operators.balancers.multi_label.zero_vs_many_labels", overwrite=True
 )
 
-balancer = LengthBalancer(fields=["outputs/labels"], segments_boundaries=[1])
+balancer = LengthBalancer(fields=["reference_fields/labels"], segments_boundaries=[1])
 
 add_to_catalog(
     balancer, "operators.balancers.ner.zero_vs_many_entities", overwrite=True
 )
 
-balancer = MinimumOneExamplePerLabelRefiner(fields=["outputs/label"])
+balancer = MinimumOneExamplePerLabelRefiner(fields=["reference_fields/label"])
 
 add_to_catalog(
     balancer,

diff --git a/prepare/tasks/classification.py b/prepare/tasks/classification.py
@@ -3,8 +3,8 @@
 
 add_to_catalog(
     Task(
-        inputs={"text": "str", "text_type": "str", "class": "str"},
-        outputs={"class": "str", "label": "List[str]"},
+        input_fields={"text": "str", "text_type": "str", "class": "str"},
+        reference_fields={"class": "str", "label": "List[str]"},
         prediction_type="List[str]",
         metrics=[
             "metrics.f1_micro_multi_label",
@@ -20,8 +20,8 @@
 
 add_to_catalog(
     Task(
-        inputs={"text": "str", "text_type": "str", "class": "str"},
-        outputs={"class": "str", "label": "int"},
+        input_fields={"text": "str", "text_type": "str", "class": "str"},
+        reference_fields={"class": "str", "label": "int"},
         prediction_type="float",
         metrics=[
             "metrics.accuracy",
@@ -36,13 +36,13 @@
 
 add_to_catalog(
     Task(
-        inputs={
+        input_fields={
             "text": "str",
             "text_type": "str",
             "classes": "List[str]",
             "type_of_classes": "str",
         },
-        outputs={"labels": "List[str]"},
+        reference_fields={"labels": "List[str]"},
         prediction_type="List[str]",
         metrics=[
             "metrics.f1_micro_multi_label",
@@ -58,13 +58,13 @@
 
 add_to_catalog(
     Task(
-        inputs={
+        input_fields={
             "text": "str",
             "text_type": "str",
             "classes": "List[str]",
             "type_of_class": "str",
         },
-        outputs={"label": "str"},
+        reference_fields={"label": "str"},
         prediction_type="str",
         metrics=["metrics.f1_micro", "metrics.accuracy", "metrics.f1_macro"],
         augmentable_inputs=["text"],
@@ -76,15 +76,15 @@
 
 add_to_catalog(
     Task(
-        inputs={
+        input_fields={
             "text_a": "str",
             "text_a_type": "str",
             "text_b": "str",
             "text_b_type": "str",
             "classes": "List[str]",
             "type_of_relation": "str",
         },
-        outputs={"label": "str"},
+        reference_fields={"label": "str"},
         prediction_type="str",
         metrics=["metrics.f1_micro", "metrics.accuracy", "metrics.f1_macro"],
         augmentable_inputs=["text_a", "text_b"],
@@ -97,14 +97,14 @@
 
 add_to_catalog(
     Task(
-        inputs={
+        input_fields={
             "text": "str",
             "text_type": "str",
             "classes": "List[str]",
             "type_of_class": "str",
             "classes_descriptions": "str",
         },
-        outputs={"label": "str"},
+        reference_fields={"label": "str"},
         prediction_type="str",
         metrics=["metrics.f1_micro", "metrics.accuracy", "metrics.f1_macro"],
         augmentable_inputs=["text"],
@@ -116,13 +116,13 @@
 
 add_to_catalog(
     Task(
-        inputs={
+        input_fields={
             "text": "str",
             "text_type": "str",
             "classes": "List[str]",
             "type_of_class": "str",
         },
-        outputs={"label": "str"},
+        reference_fields={"label": "str"},
         prediction_type="str",
         metrics=["metrics.f1_micro", "metrics.accuracy", "metrics.f1_macro"],
         augmentable_inputs=["text"],

diff --git a/prepare/tasks/completion/multiple_choice.py b/prepare/tasks/completion/multiple_choice.py
@@ -3,8 +3,8 @@
 
 add_to_catalog(
     Task(
-        inputs={"context": "str", "context_type": "str", "choices": "List[str]"},
-        outputs={"answer": "int", "choices": "List[str]"},
+        input_fields={"context": "str", "context_type": "str", "choices": "List[str]"},
+        reference_fields={"answer": "int", "choices": "List[str]"},
         prediction_type="Any",
         metrics=["metrics.accuracy"],
     ),
@@ -14,8 +14,12 @@
 
 add_to_catalog(
     Task(
-        inputs={"context": "str", "context_type": "str", "completion_type": "str"},
-        outputs={"completion": "str"},
+        input_fields={
+            "context": "str",
+            "context_type": "str",
+            "completion_type": "str",
+        },
+        reference_fields={"completion": "str"},
         prediction_type="str",
         metrics=["metrics.rouge"],
     ),
@@ -25,8 +29,12 @@
 
 add_to_catalog(
     Task(
-        inputs={"context": "str", "context_type": "str", "completion_type": "str"},
-        outputs={"completion": "str"},
+        input_fields={
+            "context": "str",
+            "context_type": "str",
+            "completion_type": "str",
+        },
+        reference_fields={"completion": "str"},
         prediction_type="Dict[str,Any]",
         metrics=["metrics.squad"],
     ),

diff --git a/prepare/tasks/evaluation.py b/prepare/tasks/evaluation.py
@@ -3,8 +3,8 @@
 
 add_to_catalog(
     Task(
-        inputs=["input", "input_type", "output_type", "choices", "instruction"],
-        outputs=["choices", "output_choice"],
+        input_fields=["input", "input_type", "output_type", "choices", "instruction"],
+        reference_fields=["choices", "output_choice"],
         metrics=[
             "metrics.accuracy",
         ],

diff --git a/prepare/tasks/generation.py b/prepare/tasks/generation.py
@@ -3,8 +3,8 @@
 
 add_to_catalog(
     Task(
-        inputs={"input": "str", "type_of_input": "str", "type_of_output": "str"},
-        outputs={"output": "str"},
+        input_fields={"input": "str", "type_of_input": "str", "type_of_output": "str"},
+        reference_fields={"output": "str"},
         prediction_type="str",
         metrics=["metrics.normalized_sacrebleu"],
         augmentable_inputs=["input"],

diff --git a/prepare/tasks/grammatical_error_correction.py b/prepare/tasks/grammatical_error_correction.py
@@ -3,8 +3,8 @@
 
 add_to_catalog(
     Task(
-        inputs=["original_text"],
-        outputs=["corrected_texts"],
+        input_fields=["original_text"],
+        reference_fields=["corrected_texts"],
         metrics=[
             "metrics.char_edit_dist_accuracy",
             "metrics.rouge",

diff --git a/prepare/tasks/language_identification.py b/prepare/tasks/language_identification.py
@@ -3,8 +3,8 @@
 
 add_to_catalog(
     Task(
-        inputs={"text": "str"},
-        outputs={"label": "str"},
+        input_fields={"text": "str"},
+        reference_fields={"label": "str"},
         prediction_type="str",
         metrics=["metrics.accuracy"],
     ),