diff --git a/.github/workflows/catalog_consistency.yml b/.github/workflows/catalog_consistency.yml index aeae12681..3c4f92225 100644 --- a/.github/workflows/catalog_consistency.yml +++ b/.github/workflows/catalog_consistency.yml @@ -12,13 +12,15 @@ jobs: runs-on: ubuntu-latest env: OS: ubuntu-latest + GENAI_KEY: "dummy" + UNITXT_ALLOW_PASSING_DATA_TO_REMOTE_API: "True" steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: - python-version: '3.8' + python-version: '3.9' cache: 'pip' # caching pip dependencies - run: pip install -r requirements/base.rqr - run: pip install -r requirements/tests.rqr diff --git a/.github/workflows/catalog_preparation.yml b/.github/workflows/catalog_preparation.yml index 95271dd6c..657fc4ead 100644 --- a/.github/workflows/catalog_preparation.yml +++ b/.github/workflows/catalog_preparation.yml @@ -12,13 +12,15 @@ jobs: runs-on: ubuntu-latest env: OS: ubuntu-latest + GENAI_KEY: "dummy" + UNITXT_ALLOW_PASSING_DATA_TO_REMOTE_API: "True" steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: - python-version: '3.8' + python-version: '3.9' cache: 'pip' # caching pip dependencies - run: pip install -r requirements/base.rqr - run: pip install -r requirements/tests.rqr diff --git a/.github/workflows/library_tests.yml b/.github/workflows/library_tests.yml index 464eb317c..df6d43077 100644 --- a/.github/workflows/library_tests.yml +++ b/.github/workflows/library_tests.yml @@ -18,7 +18,7 @@ jobs: - uses: actions/setup-python@v5 with: - python-version: '3.8' + python-version: '3.9' cache: 'pip' # caching pip dependencies - run: pip install -r requirements/base.rqr - run: pip install -r requirements/tests.rqr diff --git a/docs/docs/llm_as_judge.rst b/docs/docs/llm_as_judge.rst index 91c4ed35b..858adc48c 100644 --- a/docs/docs/llm_as_judge.rst +++ b/docs/docs/llm_as_judge.rst @@ -2,128 +2,367 @@ .. note:: - To use this tutorial, you need to :ref:`install unitxt `. + To follow this tutorial, ensure you have :ref:`unitxt installed `. ===================================== -LLM As a Judge Metrics ✨ +LLM as a Judge Metrics Guide 📊 ===================================== -In this section you learn how to use LLM as judge metric by unitxt. LLM as a judge is a method for evaluation the -performance of a model based on the output of another model. +This section will walk you through harnessing the power of LLM as judge (LLMaJ) metrics using the Unitxt package. LLM as a judge +provides a method to assess the performance of a model based on the judgments of another model. -Using LLM As a Judge in unitxt ----------------------------- -Using LLM as a judge is extremely simple in unitxt. You should simply choose llm as a judge metric, and unitxt will do the rest... +In this guide, we'll explore three key aspects of LLMaJ: + 1. Utilizing LLM as judge as a metric in Unitxt. + 2. Incorporating a new LLM as a judge metric into Unitxt. + 3. Assessing the quality of an LLM as a judge metric. -The Unitxt catalog includes a collection of preexisting LLM as judges that can be used like any other -metric. +But first, let's start with an overview: -To specify an LLM as judge metric, you can specify it in the dataset or in the recipe. For example: +Overview +--------- -.. code-block:: python +An LLM as a Judge metric consists of several essential components: + +1. The judge model, such as *Llama-3-8B-Instruct* or *gpt-3.5-turbo*, which evaluates the performance of other models. +2. The platform responsible for executing the judge model, such as Huggingface or OpenAI API. +3. The template used to construct prompts for the judge model. This template should be reflective of the judgment needed and usually incorporates both the input and output of the evaluated model. For instance: + + .. code-block:: text + + Please rate the clarity, coherence, and informativeness of the following summary on a scale of 1 to 10\\n Full text: {model_input}\\nSummary: {model_output} + +4. The format in which the judge model expects to receive prompts. For example: - card=cards.almost_evil,template=templates.qa.open.simple,metrics=[metrics.rag.model_response_assessment.llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template]", + .. code-block:: text + {input} +5. Optionally, a system prompt to pass to the judge model. This can provide additional context for evaluation. -Adding new LLM As a Judge metric: ----------------------------- +Understanding these components is crucial for effectively leveraging LLM as a judge metrics. With this foundation, let's examine how to utilize and create these metrics in the Unitxt package. -For a classical code-based metric (like F1, Rouge), the general evaluation flow is the following: - 1. load the dataset using a unitxt recipe (e.g. "cards.sst2") +Using LLM as a Judge in Unitxt +------------------------------- +Employing a pre-defined LLM as a judge metric is effortlessly achieved within Unitxt. - 2. use inference module to infer based on the dataset inputs. +The Unitxt catalog boasts a variety of preexisting LLM as judges that seamlessly integrate into your workflow. - 3. create a metric and evaluate the results. +Let's consider an example of evaluating a *flan-t5-small* model on the MT-Bench benchmark, specifically utilizing the single model rating evaluation part of the benchmark. In this part, we provide the LLM as a Judge, the input provided to the model and the output it generation. The LLM as Judge is asked to rate how well the output of the model address the request in the input. -In LLM as judge metric, we should feed a judge model by the predictions of the model we want to test, and ask it to judge -these prediction. The evaluation scores should be the predictions of the judge model. +To accomplish this evaluation, we require the following: -Therefore, LLM as a judge flow: - 1. create dataset +1. A Unitxt dataset card containing MT-Bench inputs, which will serve as the input for our evaluated model. +2. A Unitxt template to be paired with the card. As the MT-Bench dataset already includes full prompts, there is no need to construct one using a template; hence, we'll opt for the *empty* template, which just passes the input prompt from the dataset to the model. +3. A unitxt format to be utilized with the card. Given that *flan* models do not demand special formatting of the inputs, we'll utilize the *empty* format here as well. +4. An LLM as a judge metric leveraging the MT-Bench evaluation prompt. - 2. use inference module to infer based on the dataset inputs. +Fortunately, all these components are readily available in the Unitxt catalog, including a judge model based on *Mistral* from Huggingface that employs the MT-Bench format. +From here, constructing the full unitxt recipe string is standard and straightforward: - 3. create a metric and evaluate the results. - 3.1 create judging dataset, based on a desired specification (e.g. the desired template and format), and the prediction generated in (2.) +.. code-block:: text - 3.2 getting a judge model, and infer it by the dataset generated in (3.1) + card=cards.mt_bench.generation.english_single_turn, + template=templates.empty, + format=formats.empty, + metrics=[metrics.llm_as_judge.rating.mistralai_Mistral_7B_Instruct_v0_2_huggingface_template_mt_bench_single_turn] - 3.3 extract the results from the judge predictions +.. note:: + + Pay attention! + We are using the mistralai/Mistral-7B-Instruct-v0.2 model from Huggingface. Using this model requires you to agree to the Terms of Use on the model page and set the HUGGINGFACE_TOKEN environment argument. Other platforms might have different requirements. For example if you are using an LLM as judge based on the OpenAI platform, you will need to set your OpenAI api key. -In order to create new LLM as a judge metric, one should decide which model should be the judge, and -how to create it input text based on the prediction of the tested model. -Lets review an example of adding a LLM by judge metric: +The following code performs the desired evaluation: .. code-block:: python - import evaluate from datasets import load_dataset from unitxt.inference import HFPipelineBasedInferenceEngine + from unitxt import evaluate # 1. Create the dataset - dataset = load_dataset("unitxt/data", "card=cards.almost_evil,template=templates.qa.open.simple," - "metrics=[metrics.rag.model_response_assessment.llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template]", + card = ("card=cards.mt_bench.generation.english_single_turn," + "template=templates.empty," + "format=formats.empty," + "metrics=[metrics.llm_as_judge.rating.mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn]" + ) + + dataset = load_dataset("unitxt/data", + card, split='test') # 2. use inference module to infer based on the dataset inputs. - inference_model = HFPipelineBasedInferenceEngine(model_name="google/flan-t5-small", max_new_tokens=32) + inference_model = HFPipelineBasedInferenceEngine(model_name="google/flan-t5-small", max_new_tokens=32, use_fp16=True) predictions = inference_model.infer(dataset) + # 3. create a metric and evaluate the results. - metric = evaluate.load("unitxt/metric") - scores = metric.compute(predictions=predictions, references=dataset) + scores = evaluate(predictions=predictions, data=dataset) [print(item) for item in scores[0]["score"]["global"].items()] -In this case, we used the metric metrics.rag.model_response_assessment.llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template, which uses flan t5 -as a judge, and it use mt_bench recipe for creating the judging dataset. -In order to create new LLM as a judge metric, you should simply use the LLMAsJudge class. For example, lets see the definition -of metrics.rag.model_response_assessment.llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template: + +Creating a new LLM as a Judge Metric +------------------------------------- + +To construct a new LLM as a Judge metric, several key components must be defined: + +1. **Judge Model**: Select a model that will assess the performance of other models. +2. **Execution Platform**: Choose the platform responsible for executing the judge model, such as Huggingface or OpenAI API. +3. **The Judging Task**: This define the inputs the judge model expect to receive and its output. This is coupled with the template. Two common tasks are single model rating we saw above and pairwise model comparison, in which the outputs of two models is compared, to see which better addressed the required input. +4. **Template**: Develop a template reflecting the criteria for judgment, usually incorporating both the input and output of the evaluated model. +5. **Format**: Specify the format in which the judge model expects to receive prompts. +6. **System Prompt (Optional)**: Optionally, include a system prompt to provide additional context for evaluation. + +Let's walk through an example of creating a new LLM as a Judge metric, specifically recreating the MT-Bench judge metric single-model-rating evaluation: + +1. **Selecting a Judge Model**: We will utilize the *mistralai/Mistral-7B-Instruct-v0.2* model from Huggingface as our judge model. +2. **Selecting an Execution Platform**: We will opt to execute the model locally using Huggingface. + + For this example, we will use the *HFPipelineInferenceEngine* class: + + .. code-block:: python + + from unitxt.inference import HFPipelineInferenceEngine + from unitxt.llm_as_judge import LLMAsJudge + + model_id = "mistralai/Mistral-7B-Instruct-v0.2" + inference_model = HFPipelineInferenceEngine(model_name=model_id, max_generated_tokens=256) + + + .. note:: + + If you wish to use a different platform for running your judge model, you can implement + a new `InferenceEngine` class and substitute it with the `HFPipelineInferenceEngine`. + You can find the definition of the `InferenceEngine` abstract class and pre-built inference engines + (e.g., `OpenAiInferenceEngine`) in `src/unitxt/inference.py`. +3. **Selecting the Judging Task**: This is a standard Unitxt task that defines the api of the judge model. The task specifies the input fields expected by the judge model, such as "question" and "answer," in the example below, which are utilized in the subsequent template. Additionally, it defines the expected output field as a float type. Another significant field is "metrics," which is utilized for the (meta) evaluation of the judge, as explained in the following section. Currently supported tasks are "rating.single_turn" and "rating.single_turn_with_reference". + + .. code-block:: python + + from unitxt.blocks import FormTask + from unitxt.catalog import add_to_catalog + + add_to_catalog( + FormTask( + inputs={"question": "str", "answer": "str"}, + outputs={"rating": "float"}, + metrics=["metrics.spearman"], + ), + "tasks.response_assessment.rating.single_turn", + overwrite=True, + ) + +4. **Define the Template**: We want to construct a template that is identical to the MT-Bench judge metric. Pay attention that this metric have field that are compatible with the task we chose ("question", "answer" and "rating"). + + .. code-block:: python + + from unitxt import add_to_catalog + from unitxt.templates import InputOutputTemplate + + add_to_catalog( + InputOutputTemplate( + instruction="Please act as an impartial judge and evaluate the quality of the response provided" + " by an AI assistant to the user question displayed below. Your evaluation should consider" + " factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of" + " detail of the response. Begin your evaluation by providing a short explanation. Be as" + " objective as possible. After providing your explanation, you must rate the response" + ' on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example:' + ' "Rating: [[5]]".\n\n', + input_format="[Question]\n{question}\n\n" + "[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", + output_format="[[{rating}]]", + postprocessors=[ + r"processors.extract_mt_bench_rating_judgment", + ], + ), + "templates.response_assessment.rating.mt_bench_single_turn", + overwrite=True, + ) + + .. note:: + + Ensure the template includes a postprocessor for extracting the judgment from the judge model output and + passing it as a metric score. In our example, the template specifies for the judge the expected judgment format + ("you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]""), + and such, it also defines the processor for extracting the judgment. (postprocessors=[r"processors.extract_mt_bench_rating_judgment"],). + This processor simply extract the number within [[ ]] and divide it by 10 in order to scale to to [0, 1]. + + +5. **Define Format**: Define the format expected by the judge model for receiving prompts. For Mitral models, you can use the format already available in the Unitxt catalog under *"formats.models.mistral.instruction""*. + +6. **Define System Prompt**: We will not use a system prompt in this example. + +With these components defined, creating a new LLM as a Judge metric is straightforward: + .. code-block:: python from unitxt import add_to_catalog from unitxt.inference import HFPipelineBasedInferenceEngine from unitxt.llm_as_judge import LLMAsJudge + model_id = "mistralai/Mistral-7B-Instruct-v0.2" + format = "formats.models.mistral.instruction" + template = "templates.response_assessment.rating.mt_bench_single_turn" + task = "rating.single_turn" + inference_model = HFPipelineBasedInferenceEngine( - model_name="google/flan-t5-large", max_new_tokens=32 + model_name=model_id, max_new_tokens=256, use_fp16=True ) - recipe = ( - "card=cards.rag.model_response_assessment.llm_as_judge_using_mt_bench_template," - "template=templates.rag.model_response_assessment.llm_as_judge_using_mt_bench_template," - "demos_pool_size=0," - "num_demos=0" + model_label = model_id.split("/")[1].replace("-", "_").replace(".", "_").lower() + model_label = f"{model_label}_huggingface" + template_label = template.split(".")[-1] + metric_label = f"{model_label}_template_{template_label}" + metric = LLMAsJudge( + inference_model=inference_model, + template=template, + task=task, + format=format, + main_score=metric_label, ) - metric = LLMAsJudge(inference_model=inference_model, recipe=recipe) - add_to_catalog( metric, - "metrics.rag.model_response_assessment.llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template", + f"metrics.llm_as_judge.rating.{model_label}_template_{template_label}", + overwrite=True, + ) + + + +.. note:: + + The *LLMAsJudge* class can receive the boolean argument *strip_system_prompt_and_format_from_inputs* + (defaulting to *True*). When set to *True*, any system prompts or formatting in the inputs received by + the evaluated model will be stripped. + +Evaluating a LLMaJ metric (Meta-evaluation) +-------------------------------------------- +But wait, we missed a step! We know the LLM as a judge we created worth anything? +The answer is: You evaluate it like any other model in Unitxt. +Remember the task we defined in the previous section? + + .. code-block:: python + + from unitxt.blocks import FormTask + from unitxt.catalog import add_to_catalog + + add_to_catalog( + FormTask( + inputs={"question": "str", "answer": "str"}, + outputs={"rating": "float"}, + metrics=["metrics.spearman"], + ), + "tasks.response_assessment.rating.single_turn", + overwrite=True, + ) + +This task define the (meta) evaluation of our LLMaJ model. +We will fetch a dataset of MT-Bench inputs and models outputs, together with scores judged by GPT-4. +We will consider these GPT4 scores as our gold labels and evaluate our LLMaJ model by comparing its score on the model outputs +to the score of GPT4 using spearman correlation as defined in the task card. + +We will create a card, as we do for every other Unitxt scenario: + +.. code-block:: python + + from unitxt.blocks import ( + TaskCard, + ) + from unitxt.catalog import add_to_catalog + from unitxt.loaders import LoadHF + from unitxt.operators import ( + CopyFields, + FilterByCondition, + RenameFields, + ) + from unitxt.processors import LiteralEval + from unitxt.splitters import RenameSplits + from unitxt.test_utils.card import test_card + + card = TaskCard( + loader=LoadHF(path="OfirArviv/mt_bench_single_score_gpt4_judgement", split="train"), + preprocess_steps=[ + RenameSplits({"train": "test"}), + FilterByCondition(values={"turn": 1}, condition="eq"), + FilterByCondition(values={"reference": "[]"}, condition="eq"), + RenameFields( + field_to_field={ + "model_input": "question", + "score": "rating", + "category": "group", + "model_output": "answer", + } + ), + LiteralEval("question", to_field="question"), + CopyFields(field_to_field={"question/0": "question"}), + LiteralEval("answer", to_field="answer"), + CopyFields(field_to_field={"answer/0": "answer"}), + ], + task="tasks.response_assessment.rating.single_turn", + templates=["templates.response_assessment.rating.mt_bench_single_turn"], + ) + + test_card(card, demos_taken_from="test", strict=False) + add_to_catalog( + card, + "cards.mt_bench.response_assessment.rating.single_turn_gpt4_judgement", overwrite=True, ) -We can see, that each LLM as a judge metric needs two specifications: - 1. Inference engine with a model for judging (You can use any inference engine that implements InferenceEngine, and any desired model). - - 2. Unitxt recipe for creating the judgment inputs. - -Please note, that since the metric performs nested inference, there should be a consistency between the main recipe, and the judgment recipe. - 1. Since the judgment recipe uses the main recipe inputs and output, the names should match. In our example, - card.almost_evil uses tasks.qa.open task, which specify the input field "question" and the output field "answers". - On the other hand, cards.rag.model_response_assessment.llm_as_judge_using_mt_bench_template uses the task - tasks.rag.model_response_assessment. This task defined as input the fields "question" - which is consistent - with the main recipe field, and "model_output" - which is the standard name for the inference result. This task defines the - output field "rating_label" - which is a standard name. - - 2. Since LLM as a judge metric last step is extracting the judgment and passed it as a metric score, the template of the - recipe should define postprocessor for the extraction. Since the unitxt scores are in scase of [0, 1], the postprocessor - should convert the judgment to this scale. In our example, the card in the metric recipe - - cards.rag.model_response_assessment.llm_as_judge_using_mt_bench_template, uses the template "templates.rag.model_response_assessment.llm_as_judge_using_mt_bench_template". - This template specify for the judge how it expect the judgment format ("you must rate the response on a scale of 1 - to 10 by strictly following this format: "[[rating]]""), and on the other hand, it defines the processor for extracting - the judgment. (postprocessors=[r"processors.extract_mt_bench_judgment"],). This processor simply extract the number within - [[ ]] and divide it by 10 in order to scale to to [0, 1]. \ No newline at end of file +This is a card for the first turn inputs of the MT-Bench benchmarks (without reference), +together with the outputs of multiple models to those inputs and the scores of GPT-4 +to those outputs. + +Now all we need to do is to load the card, with the template and format the judge model is expected to use, +and run it. + +.. code-block:: python + + from datasets import load_dataset + from unitxt.inference import HFPipelineBasedInferenceEngine + from unitxt import evaluate + + # 1. Create the dataset + card = ("card=cards.mt_bench.response_assessment.rating.single_turn_gpt4_judgement," + "template=templates.response_assessment.rating.mt_bench_single_turn," + "format=formats.models.mistral.instruction") + + dataset = load_dataset("unitxt/data", + card, + split='test') + # 2. use inference module to infer based on the dataset inputs. + inference_model = HFPipelineBasedInferenceEngine(model_name="mistralai/Mistral-7B-Instruct-v0.2", + max_new_tokens=256, + use_fp16=True) + predictions = inference_model.infer(dataset) + # 3. create a metric and evaluate the results. + scores = evaluate(predictions=predictions, data=dataset) + + [print(item) for item in scores[0]["score"]["global"].items()] + +The output of this code is: + +.. code-block:: text + + ('spearmanr', 0.18328402960291354) + ('score', 0.18328402960291354) + ('score_name', 'spearmanr') + ('score_ci_low', 0.14680574316651868) + ('score_ci_high', 0.23030798909064645) + ('spearmanr_ci_low', 0.14680574316651868) + ('spearmanr_ci_high', 0.23030798909064645) + +We can see the Spearman correlation is *0.18*, which is considered low. +This means *"mistralai/Mistral-7B-Instruct-v0.2"* is not a good model to act as an LLM as a Judge, +at least when using the MT-Bench template. + +In order to understand precisely why it is so, examination of the outputs of the model is needed. +In this case, it seems Mistral is having difficulties outputting the scores in the double square brackets format. +An example for the model output is: + +.. code-block:: text + + Rating: 9 + + The assistant's response is engaging and provides a good balance between cultural experiences and must-see attractions in Hawaii. The description of the Polynesian Cultural Center and the Na Pali Coast are vivid and evoke a sense of wonder and excitement. The inclusion of traditional Hawaiian dishes adds depth and authenticity to the post. The response is also well-structured and easy to follow. However, the response could benefit from a few more specific details or anecdotes to make it even more engaging and memorable. \ No newline at end of file diff --git a/prepare/cards/dynamic_cards_for_llm_judges/llm_as_judge_metrics.py b/prepare/cards/dynamic_cards_for_llm_judges/llm_as_judge_metrics.py new file mode 100644 index 000000000..ace96f2d7 --- /dev/null +++ b/prepare/cards/dynamic_cards_for_llm_judges/llm_as_judge_metrics.py @@ -0,0 +1,15 @@ +from unitxt.blocks import TaskCard +from unitxt.catalog import add_to_catalog + +tasks = [ + "tasks.response_assessment.rating.single_turn", + "tasks.response_assessment.rating.single_turn_with_reference", +] +for task in tasks: + card = TaskCard(loader=None, preprocess_steps=[], task=task) + sub_task = ".".join(task.split(".")[-2:]) + add_to_catalog( + card, + f"cards.dynamic_cards_for_llm_judges.{sub_task}", + overwrite=True, + ) diff --git a/prepare/cards/mt_bench/generation/english_single_turn.py b/prepare/cards/mt_bench/generation/english_single_turn.py new file mode 100644 index 000000000..b06c25a29 --- /dev/null +++ b/prepare/cards/mt_bench/generation/english_single_turn.py @@ -0,0 +1,42 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + AddFields, + CopyFields, + RenameFields, +) +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF(path="dim/mt_bench_en", split="train"), + preprocess_steps=[ + RenameSplits({"train": "test"}), + CopyFields(field_to_field={"turns/0": "turns"}), + RenameFields( + field_to_field={ + "turns": "input", + "category": "group", + } + ), + AddFields( + fields={ + "output": "None", + "type_of_input": "question", + "type_of_output": "answer", + } + ), + ], + task="tasks.generation", + templates=["templates.empty"], +) + +test_card(card, demos_taken_from="test", strict=False) +add_to_catalog( + card, + "cards.mt_bench.generation.english_single_turn", + overwrite=True, +) diff --git a/prepare/cards/mt_bench/generation/japanese_single_turn.py b/prepare/cards/mt_bench/generation/japanese_single_turn.py new file mode 100644 index 000000000..bcb882602 --- /dev/null +++ b/prepare/cards/mt_bench/generation/japanese_single_turn.py @@ -0,0 +1,42 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + AddFields, + CopyFields, + RenameFields, +) +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF(path="shi3z/MTbenchJapanese", split="train"), + preprocess_steps=[ + RenameSplits({"train": "test"}), + CopyFields(field_to_field={"turns/0": "turns"}), + RenameFields( + field_to_field={ + "turns": "input", + "category": "group", + } + ), + AddFields( + fields={ + "output": "None", + "type_of_input": "question", + "type_of_output": "answer", + } + ), + ], + task="tasks.generation", + templates=["templates.empty"], +) + +test_card(card, demos_taken_from="test", strict=False) +add_to_catalog( + card, + "cards.mt_bench.generation.japanese_single_turn", + overwrite=True, +) diff --git a/prepare/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_gpt4_judgement.py b/prepare/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_gpt4_judgement.py new file mode 100644 index 000000000..e105d89f2 --- /dev/null +++ b/prepare/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_gpt4_judgement.py @@ -0,0 +1,62 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + FilterByCondition, + InterleaveListsToDialogOperator, + MapInstanceValues, + RenameFields, +) +from unitxt.processors import LiteralEval +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF( + path="OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", split="train" + ), + preprocess_steps=[ + RenameSplits({"train": "test"}), + FilterByCondition(values={"turn": 2}, condition="eq"), + FilterByCondition(values={"reference": "[]"}, condition="eq"), + FilterByCondition( + values={"winner": ["model_1", "tie", "model_2"]}, condition="in" + ), + MapInstanceValues( + mappers={ + "winner": {"model_1": "choice_a", "model_2": "choice_b", "tie": "tie"} + } + ), + RenameFields( + field_to_field={ + "category": "group", + } + ), + LiteralEval("model_input", to_field="model_input"), + LiteralEval("model_1_output", to_field="model_1_output"), + LiteralEval("model_2_output", to_field="model_2_output"), + InterleaveListsToDialogOperator( + user_turns_field="model_input", + assistant_turns_field="model_1_output", + to_field="dialog_a", + ), + InterleaveListsToDialogOperator( + user_turns_field="model_input", + assistant_turns_field="model_2_output", + to_field="dialog_b", + ), + ], + task="tasks.response_assessment.pairwise_comparison.multi_turn", + templates=[ + "templates.response_assessment.pairwise_comparison.mt_bench_multi_turn_with_shuffle" + ], +) + +test_card(card, demos_taken_from="test", strict=False) +add_to_catalog( + card, + "cards.mt_bench.response_assessment.pairwise_comparison.multi_turn_gpt4_judgement", + overwrite=True, +) diff --git a/prepare/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_with_reference_gpt4_judgement.py b/prepare/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_with_reference_gpt4_judgement.py new file mode 100644 index 000000000..d74affc31 --- /dev/null +++ b/prepare/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_with_reference_gpt4_judgement.py @@ -0,0 +1,64 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + FilterByCondition, + InterleaveListsToDialogOperator, + MapInstanceValues, + RenameFields, +) +from unitxt.processors import LiteralEval +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF( + path="OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", split="train" + ), + preprocess_steps=[ + RenameSplits({"train": "test"}), + FilterByCondition(values={"turn": 2}, condition="eq"), + FilterByCondition(values={"reference": "[]"}, condition="ne"), + FilterByCondition( + values={"winner": ["model_1", "tie", "model_2"]}, condition="in" + ), + MapInstanceValues( + mappers={ + "winner": {"model_1": "choice_a", "model_2": "choice_b", "tie": "tie"} + } + ), + RenameFields(field_to_field={"category": "group"}), + LiteralEval("model_input", to_field="model_input"), + LiteralEval("model_1_output", to_field="model_1_output"), + LiteralEval("model_2_output", to_field="model_2_output"), + LiteralEval("reference", to_field="reference"), + InterleaveListsToDialogOperator( + user_turns_field="model_input", + assistant_turns_field="model_1_output", + to_field="dialog_a", + ), + InterleaveListsToDialogOperator( + user_turns_field="model_input", + assistant_turns_field="model_2_output", + to_field="dialog_b", + ), + InterleaveListsToDialogOperator( + user_turns_field="model_input", + assistant_turns_field="reference", + to_field="reference_dialog", + ), + ], + task="tasks.response_assessment.pairwise_comparison.multi_turn_with_reference", + templates=[ + "templates.response_assessment.pairwise_comparison.mt_bench_multi_turn_with_reference_with_shuffle" + ], +) + +test_card(card, demos_taken_from="test", strict=False, loader_limit=1000) +add_to_catalog( + card, + "cards.mt_bench.response_assessment.pairwise_comparison.multi_turn_with_reference_gpt4_judgement", + overwrite=True, +) diff --git a/prepare/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_gpt4_judgement.py b/prepare/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_gpt4_judgement.py new file mode 100644 index 000000000..4002a4edd --- /dev/null +++ b/prepare/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_gpt4_judgement.py @@ -0,0 +1,58 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + CopyFields, + FilterByCondition, + MapInstanceValues, + RenameFields, +) +from unitxt.processors import LiteralEval +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF( + path="OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", split="train" + ), + preprocess_steps=[ + RenameSplits({"train": "test"}), + FilterByCondition(values={"turn": 1}, condition="eq"), + FilterByCondition(values={"reference": "[]"}, condition="eq"), + FilterByCondition( + values={"winner": ["model_1", "tie", "model_2"]}, condition="in" + ), + MapInstanceValues( + mappers={ + "winner": {"model_1": "choice_a", "model_2": "choice_b", "tie": "tie"} + } + ), + RenameFields( + field_to_field={ + "model_input": "question", + "model_1_output": "answer_a", + "model_2_output": "answer_b", + "category": "group", + } + ), + LiteralEval("question", to_field="question"), + CopyFields(field_to_field={"question/0": "question"}), + LiteralEval("answer_a", to_field="answer_a"), + CopyFields(field_to_field={"answer_a/0": "answer_a"}), + LiteralEval("answer_b", to_field="answer_b"), + CopyFields(field_to_field={"answer_b/0": "answer_b"}), + ], + task="tasks.response_assessment.pairwise_comparison.single_turn", + templates=[ + "templates.response_assessment.pairwise_comparison.mt_bench_single_turn_with_shuffle" + ], +) + +test_card(card, demos_taken_from="test", strict=False) +add_to_catalog( + card, + "cards.mt_bench.response_assessment.pairwise_comparison.single_turn_gpt4_judgement", + overwrite=True, +) diff --git a/prepare/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_with_reference_gpt4_judgement.py b/prepare/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_with_reference_gpt4_judgement.py new file mode 100644 index 000000000..20b577bb9 --- /dev/null +++ b/prepare/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_with_reference_gpt4_judgement.py @@ -0,0 +1,61 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + CopyFields, + FilterByCondition, + MapInstanceValues, + RenameFields, +) +from unitxt.processors import LiteralEval +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF( + path="OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", split="train" + ), + preprocess_steps=[ + RenameSplits({"train": "test"}), + FilterByCondition(values={"turn": 1}, condition="eq"), + FilterByCondition(values={"reference": "[]"}, condition="ne"), + FilterByCondition( + values={"winner": ["model_1", "tie", "model_2"]}, condition="in" + ), + MapInstanceValues( + mappers={ + "winner": {"model_1": "choice_a", "model_2": "choice_b", "tie": "tie"} + } + ), + RenameFields( + field_to_field={ + "model_input": "question", + "model_1_output": "answer_a", + "model_2_output": "answer_b", + "reference": "reference_answer", + "category": "group", + } + ), + LiteralEval("question", to_field="question"), + CopyFields(field_to_field={"question/0": "question"}), + LiteralEval("answer_a", to_field="answer_a"), + CopyFields(field_to_field={"answer_a/0": "answer_a"}), + LiteralEval("answer_b", to_field="answer_b"), + CopyFields(field_to_field={"answer_b/0": "answer_b"}), + LiteralEval("reference_answer", to_field="reference_answer"), + CopyFields(field_to_field={"reference_answer/0": "reference_answer"}), + ], + task="tasks.response_assessment.pairwise_comparison.single_turn_with_reference", + templates=[ + "templates.response_assessment.pairwise_comparison.mt_bench_single_turn_with_reference_with_shuffle" + ], +) + +test_card(card, demos_taken_from="test", strict=False, loader_limit=1000) +add_to_catalog( + card, + "cards.mt_bench.response_assessment.pairwise_comparison.single_turn_with_reference_gpt4_judgement", + overwrite=True, +) diff --git a/prepare/cards/mt_bench/response_assessment/rating/multi_turn_gpt4_judgement.py b/prepare/cards/mt_bench/response_assessment/rating/multi_turn_gpt4_judgement.py new file mode 100644 index 000000000..320c32ac2 --- /dev/null +++ b/prepare/cards/mt_bench/response_assessment/rating/multi_turn_gpt4_judgement.py @@ -0,0 +1,39 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + FilterByCondition, + InterleaveListsToDialogOperator, + RenameFields, +) +from unitxt.processors import LiteralEval +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF(path="OfirArviv/mt_bench_single_score_gpt4_judgement", split="train"), + preprocess_steps=[ + RenameSplits({"train": "test"}), + FilterByCondition(values={"turn": 2}, condition="eq"), + FilterByCondition(values={"reference": "[]"}, condition="eq"), + RenameFields(field_to_field={"score": "rating", "category": "group"}), + LiteralEval("model_input", to_field="model_input"), + LiteralEval("model_output", to_field="model_output"), + InterleaveListsToDialogOperator( + user_turns_field="model_input", + assistant_turns_field="model_output", + to_field="dialog", + ), + ], + task="tasks.response_assessment.rating.multi_turn", + templates=["templates.response_assessment.rating.mt_bench_multi_turn"], +) + +test_card(card, demos_taken_from="test", strict=False) +add_to_catalog( + card, + "cards.mt_bench.response_assessment.rating.multi_turn_gpt4_judgement", + overwrite=True, +) diff --git a/prepare/cards/mt_bench/response_assessment/rating/multi_turn_with_reference_gpt4_judgement.py b/prepare/cards/mt_bench/response_assessment/rating/multi_turn_with_reference_gpt4_judgement.py new file mode 100644 index 000000000..d8f16541e --- /dev/null +++ b/prepare/cards/mt_bench/response_assessment/rating/multi_turn_with_reference_gpt4_judgement.py @@ -0,0 +1,47 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + FilterByCondition, + InterleaveListsToDialogOperator, + RenameFields, +) +from unitxt.processors import LiteralEval +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF(path="OfirArviv/mt_bench_single_score_gpt4_judgement", split="train"), + preprocess_steps=[ + RenameSplits({"train": "test"}), + FilterByCondition(values={"turn": 2}, condition="eq"), + FilterByCondition(values={"reference": "[]"}, condition="ne"), + RenameFields(field_to_field={"score": "rating", "category": "group"}), + LiteralEval("model_input", to_field="model_input"), + LiteralEval("model_output", to_field="model_output"), + LiteralEval("reference", to_field="reference"), + InterleaveListsToDialogOperator( + user_turns_field="model_input", + assistant_turns_field="model_output", + to_field="dialog", + ), + InterleaveListsToDialogOperator( + user_turns_field="model_input", + assistant_turns_field="reference", + to_field="reference_dialog", + ), + ], + task="tasks.response_assessment.rating.multi_turn_with_reference", + templates=[ + "templates.response_assessment.rating.mt_bench_multi_turn_with_reference" + ], +) + +test_card(card, demos_taken_from="test", strict=False, loader_limit=1000) +add_to_catalog( + card, + "cards.mt_bench.response_assessment.rating.multi_turn_with_reference_gpt4_judgement", + overwrite=True, +) diff --git a/prepare/cards/mt_bench/response_assessment/rating/single_turn_gpt4_judgement.py b/prepare/cards/mt_bench/response_assessment/rating/single_turn_gpt4_judgement.py new file mode 100644 index 000000000..f0eba1c47 --- /dev/null +++ b/prepare/cards/mt_bench/response_assessment/rating/single_turn_gpt4_judgement.py @@ -0,0 +1,43 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + CopyFields, + FilterByCondition, + RenameFields, +) +from unitxt.processors import LiteralEval +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF(path="OfirArviv/mt_bench_single_score_gpt4_judgement", split="train"), + preprocess_steps=[ + RenameSplits({"train": "test"}), + FilterByCondition(values={"turn": 1}, condition="eq"), + FilterByCondition(values={"reference": "[]"}, condition="eq"), + RenameFields( + field_to_field={ + "model_input": "question", + "score": "rating", + "category": "group", + "model_output": "answer", + } + ), + LiteralEval("question", to_field="question"), + CopyFields(field_to_field={"question/0": "question"}), + LiteralEval("answer", to_field="answer"), + CopyFields(field_to_field={"answer/0": "answer"}), + ], + task="tasks.response_assessment.rating.single_turn", + templates=["templates.response_assessment.rating.mt_bench_single_turn"], +) + +test_card(card, demos_taken_from="test", strict=False) +add_to_catalog( + card, + "cards.mt_bench.response_assessment.rating.single_turn_gpt4_judgement", + overwrite=True, +) diff --git a/prepare/cards/mt_bench/response_assessment/rating/single_turn_with_reference_gpt4_judgement.py b/prepare/cards/mt_bench/response_assessment/rating/single_turn_with_reference_gpt4_judgement.py new file mode 100644 index 000000000..d56fd7491 --- /dev/null +++ b/prepare/cards/mt_bench/response_assessment/rating/single_turn_with_reference_gpt4_judgement.py @@ -0,0 +1,48 @@ +from unitxt.blocks import ( + TaskCard, +) +from unitxt.catalog import add_to_catalog +from unitxt.loaders import LoadHF +from unitxt.operators import ( + CopyFields, + FilterByCondition, + RenameFields, +) +from unitxt.processors import LiteralEval +from unitxt.splitters import RenameSplits +from unitxt.test_utils.card import test_card + +card = TaskCard( + loader=LoadHF(path="OfirArviv/mt_bench_single_score_gpt4_judgement", split="train"), + preprocess_steps=[ + RenameSplits({"train": "test"}), + FilterByCondition(values={"turn": 1}, condition="eq"), + FilterByCondition(values={"reference": "[]"}, condition="ne"), + RenameFields( + field_to_field={ + "model_input": "question", + "score": "rating", + "category": "group", + "reference": "reference_answer", + "model_output": "answer", + } + ), + LiteralEval("question", to_field="question"), + CopyFields(field_to_field={"question/0": "question"}), + LiteralEval("answer", to_field="answer"), + CopyFields(field_to_field={"answer/0": "answer"}), + LiteralEval("reference_answer", to_field="reference_answer"), + CopyFields(field_to_field={"reference_answer/0": "reference_answer"}), + ], + task="tasks.response_assessment.rating.single_turn_with_reference", + templates=[ + "templates.response_assessment.rating.mt_bench_single_turn_with_reference" + ], +) + +test_card(card, demos_taken_from="test", strict=False, loader_limit=1000) +add_to_catalog( + card, + "cards.mt_bench.response_assessment.rating.single_turn_with_reference_gpt4_judgement", + overwrite=True, +) diff --git a/prepare/cards/mt_bench_judge.py b/prepare/cards/mt_bench_judge.py deleted file mode 100644 index 0e6975715..000000000 --- a/prepare/cards/mt_bench_judge.py +++ /dev/null @@ -1,17 +0,0 @@ -from unitxt.blocks import TaskCard -from unitxt.catalog import add_to_catalog - -card = TaskCard( - loader=None, - preprocess_steps=[], - task="tasks.rag.model_response_assessment", - templates=[ - "templates.rag.model_response_assessment.llm_as_judge_using_mt_bench_template" - ], -) - -add_to_catalog( - card, - "cards.rag.model_response_assessment.llm_as_judge_using_mt_bench_template", - overwrite=True, -) diff --git a/prepare/formats/models/llama3.py b/prepare/formats/models/llama3.py index 5bad2f80f..b56f6fd54 100644 --- a/prepare/formats/models/llama3.py +++ b/prepare/formats/models/llama3.py @@ -1,13 +1,26 @@ from unitxt.catalog import add_to_catalog from unitxt.formats import SystemFormat -# see: https://huggingface.co/blog/llama3#how-to-prompt-llama-3 +# see: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ +# <|begin_of_text|><|start_header_id|>system<|end_header_id|> +# {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|> +# {{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|> format = SystemFormat( - model_input_format="<|start_header_id|>system<|end_header_id|>\n" + demo_format="{source}\n\n{target_prefix}{target}\n\n", + model_input_format="<|begin_of_text|><|eot_id|><|start_header_id|>user<|end_header_id|>\n" + "{instruction}{demos}{source}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n" + "{target_prefix}", +) + +add_to_catalog(format, "formats.llama3_chat", overwrite=True) + +format = SystemFormat( + demo_format="{source}\n\n{target_prefix}{target}\n\n", + model_input_format="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n" "{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n" - "{instruction}\\N{source}\\N{target_prefix}<|eot_id|>" - "<|start_header_id|>assistant<|end_header_id|>\n", + "{instruction}{demos}{source}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n" + "{target_prefix}", ) -add_to_catalog(format, "formats.models.llama3", overwrite=True) +add_to_catalog(format, "formats.llama3_chat_with_system_prompt", overwrite=True) diff --git a/prepare/formats/models/phi_3.py b/prepare/formats/models/phi_3.py new file mode 100644 index 000000000..87cba2ae6 --- /dev/null +++ b/prepare/formats/models/phi_3.py @@ -0,0 +1,13 @@ +from unitxt.catalog import add_to_catalog +from unitxt.formats import SystemFormat + +format = SystemFormat( + demo_format="<|user|>\n{instruction}{source}<|end|>\n" + "<|assistant|>\n{target_prefix}{target}<|end|>\n", + model_input_format="<|user|>\n{system_prompt}<|end|>\n" + "{demos}" + "<|user|>\n{instruction}{source}<|end|>\n" + "<|assistant|>\n{target_prefix}", +) + +add_to_catalog(format, "formats.models.phi_3", overwrite=True) diff --git a/prepare/metrics/llm_as_judge/rating/llama_3_ibm_genai_mt_bench_template.py b/prepare/metrics/llm_as_judge/rating/llama_3_ibm_genai_mt_bench_template.py new file mode 100644 index 000000000..c53fcc1a5 --- /dev/null +++ b/prepare/metrics/llm_as_judge/rating/llama_3_ibm_genai_mt_bench_template.py @@ -0,0 +1,34 @@ +from unitxt import add_to_catalog +from unitxt.inference import ( + IbmGenAiInferenceEngine, + IbmGenAiInferenceEngineParams, +) +from unitxt.llm_as_judge import LLMAsJudge + +model_list = ["meta-llama/llama-3-8b-instruct", "meta-llama/llama-3-70b-instruct"] +format = "formats.llama3_chat" +template = "templates.response_assessment.rating.mt_bench_single_turn" +task = "rating.single_turn" + +gen_params = IbmGenAiInferenceEngineParams(max_new_tokens=252) +for model_id in model_list: + inference_model = IbmGenAiInferenceEngine( + model_name=model_id, parameters=gen_params + ) + model_label = model_id.split("/")[1].replace("-", "_").replace(".", ",").lower() + model_label = f"{model_label}_ibm_genai" + template_label = template.split(".")[-1] + metric_label = f"{model_label}_template_{template_label}" + metric = LLMAsJudge( + inference_model=inference_model, + template=template, + task=task, + format=format, + main_score=metric_label, + ) + + add_to_catalog( + metric, + f"metrics.llm_as_judge.rating.{model_label}_template_{template_label}", + overwrite=True, + ) diff --git a/prepare/metrics/llm_as_judge/rating/mistral_huggingface_mt_bench_template.py b/prepare/metrics/llm_as_judge/rating/mistral_huggingface_mt_bench_template.py new file mode 100644 index 000000000..4d7b2ec9e --- /dev/null +++ b/prepare/metrics/llm_as_judge/rating/mistral_huggingface_mt_bench_template.py @@ -0,0 +1,30 @@ +from unitxt import add_to_catalog +from unitxt.inference import HFPipelineBasedInferenceEngine +from unitxt.llm_as_judge import LLMAsJudge + +model_list = ["mistralai/Mistral-7B-Instruct-v0.2"] +format = "formats.models.mistral.instruction" +template = "templates.response_assessment.rating.mt_bench_single_turn" +task = "rating.single_turn" + +for model_id in model_list: + inference_model = HFPipelineBasedInferenceEngine( + model_name=model_id, max_new_tokens=256, use_fp16=True + ) + model_label = model_id.split("/")[1].replace("-", "_").replace(".", "_").lower() + model_label = f"{model_label}_huggingface" + template_label = template.split(".")[-1] + metric_label = f"{model_label}_template_{template_label}" + metric = LLMAsJudge( + inference_model=inference_model, + template=template, + task=task, + format=format, + main_score=metric_label, + ) + + add_to_catalog( + metric, + f"metrics.llm_as_judge.rating.{model_label}_template_{template_label}", + overwrite=True, + ) diff --git a/prepare/metrics/llm_as_judge_response_assessment.py b/prepare/metrics/llm_as_judge_response_assessment.py deleted file mode 100644 index 800daf57c..000000000 --- a/prepare/metrics/llm_as_judge_response_assessment.py +++ /dev/null @@ -1,23 +0,0 @@ -from unitxt import add_to_catalog -from unitxt.inference import HFPipelineBasedInferenceEngine -from unitxt.llm_as_judge import LLMAsJudge - -inference_model = HFPipelineBasedInferenceEngine( - model_name="google/flan-t5-large", max_new_tokens=32 -) -recipe = ( - "card=cards.rag.model_response_assessment.llm_as_judge_using_mt_bench_template," - "template=templates.rag.model_response_assessment.llm_as_judge_using_mt_bench_template," - "demos_pool_size=0," - "num_demos=0" -) -main_score = "llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template" -metric = LLMAsJudge( - inference_model=inference_model, recipe=recipe, main_score=main_score -) - -add_to_catalog( - metric, - "metrics.rag.model_response_assessment.llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template", - overwrite=True, -) diff --git a/prepare/processors/processors.py b/prepare/processors/processors.py index 3a08e7da3..92d574965 100644 --- a/prepare/processors/processors.py +++ b/prepare/processors/processors.py @@ -6,7 +6,8 @@ from unitxt.processors import ( Capitalize, ConvertToBoolean, - ExtractMtBenchJudgment, + ExtractMtBenchLabelJudgment, + ExtractMtBenchRatingJudgment, ExtractWithRegex, FirstCharacter, GetStringAfter, @@ -319,16 +320,32 @@ add_to_catalog( SequentialOperator( steps=[ - ExtractMtBenchJudgment( + ExtractMtBenchRatingJudgment( field="prediction", ), - ExtractMtBenchJudgment( + ExtractMtBenchRatingJudgment( field="references", process_every_value=True, ), ] ), - "processors.extract_mt_bench_judgment", + "processors.extract_mt_bench_rating_judgment", + overwrite=True, +) + +add_to_catalog( + SequentialOperator( + steps=[ + ExtractMtBenchLabelJudgment( + field="prediction", + ), + ExtractMtBenchLabelJudgment( + field="references", + process_every_value=True, + ), + ] + ), + "processors.extract_mt_bench_label_judgment", overwrite=True, ) diff --git a/prepare/tasks/response_assessment/pairwise_comparison/multi_turn.py b/prepare/tasks/response_assessment/pairwise_comparison/multi_turn.py new file mode 100644 index 000000000..898ac6e6a --- /dev/null +++ b/prepare/tasks/response_assessment/pairwise_comparison/multi_turn.py @@ -0,0 +1,17 @@ +from unitxt.blocks import FormTask +from unitxt.catalog import add_to_catalog + +add_to_catalog( + FormTask( + inputs={ + "dialog_a": "List[Tuple[str, str]]", + "dialog_b": "List[Tuple[str, str]]", + }, + outputs={ + "winner": "str" + }, # TODO: Support and change to "Literal['choice_a', 'choice_b', 'tie']"}, + metrics=["metrics.accuracy"], + ), + "tasks.response_assessment.pairwise_comparison.multi_turn", + overwrite=True, +) diff --git a/prepare/tasks/response_assessment/pairwise_comparison/multi_turn_with_reference.py b/prepare/tasks/response_assessment/pairwise_comparison/multi_turn_with_reference.py new file mode 100644 index 000000000..ddb206ad3 --- /dev/null +++ b/prepare/tasks/response_assessment/pairwise_comparison/multi_turn_with_reference.py @@ -0,0 +1,18 @@ +from unitxt.blocks import FormTask +from unitxt.catalog import add_to_catalog + +add_to_catalog( + FormTask( + inputs={ + "dialog_a": "List[Tuple[str, str]]", + "dialog_b": "List[Tuple[str, str]]", + "reference_dialog": "List[Tuple[str, str]]", + }, + outputs={ + "winner": "str" + }, # TODO: Support and change to "Literal['choice_a', 'choice_b', 'tie']"}, + metrics=["metrics.accuracy"], + ), + "tasks.response_assessment.pairwise_comparison.multi_turn_with_reference", + overwrite=True, +) diff --git a/prepare/tasks/response_assessment/pairwise_comparison/single_turn.py b/prepare/tasks/response_assessment/pairwise_comparison/single_turn.py new file mode 100644 index 000000000..c1912cacf --- /dev/null +++ b/prepare/tasks/response_assessment/pairwise_comparison/single_turn.py @@ -0,0 +1,18 @@ +from unitxt.blocks import FormTask +from unitxt.catalog import add_to_catalog + +add_to_catalog( + FormTask( + inputs={ + "question": "str", + "answer_a": "str", + "answer_b": "str", + }, + outputs={ + "winner": "str" + }, # TODO: Support and change to "Literal['choice_a', 'choice_b', 'tie']" + metrics=["metrics.accuracy"], + ), + "tasks.response_assessment.pairwise_comparison.single_turn", + overwrite=True, +) diff --git a/prepare/tasks/response_assessment/pairwise_comparison/single_turn_with_reference.py b/prepare/tasks/response_assessment/pairwise_comparison/single_turn_with_reference.py new file mode 100644 index 000000000..cdf9d38ba --- /dev/null +++ b/prepare/tasks/response_assessment/pairwise_comparison/single_turn_with_reference.py @@ -0,0 +1,19 @@ +from unitxt.blocks import FormTask +from unitxt.catalog import add_to_catalog + +add_to_catalog( + FormTask( + inputs={ + "question": "str", + "answer_a": "str", + "answer_b": "str", + "reference_answer": "str", + }, + outputs={ + "winner": "str" + }, # TODO: Support and change to "Literal['choice_a', 'choice_b', 'tie']"}, + metrics=["metrics.accuracy"], + ), + "tasks.response_assessment.pairwise_comparison.single_turn_with_reference", + overwrite=True, +) diff --git a/prepare/tasks/llm_as_judge/model_response_assessment.py b/prepare/tasks/response_assessment/rating/multi_turn.py similarity index 55% rename from prepare/tasks/llm_as_judge/model_response_assessment.py rename to prepare/tasks/response_assessment/rating/multi_turn.py index d649794b1..a0d95258c 100644 --- a/prepare/tasks/llm_as_judge/model_response_assessment.py +++ b/prepare/tasks/response_assessment/rating/multi_turn.py @@ -3,10 +3,10 @@ add_to_catalog( FormTask( - inputs=["question", "model_output"], - outputs=["rating_label"], + inputs={"dialog": "List[Tuple[str, str]]"}, + outputs={"rating": "float"}, metrics=["metrics.spearman"], ), - "tasks.rag.model_response_assessment", + "tasks.response_assessment.rating.multi_turn", overwrite=True, ) diff --git a/prepare/tasks/response_assessment/rating/multi_turn_with_reference.py b/prepare/tasks/response_assessment/rating/multi_turn_with_reference.py new file mode 100644 index 000000000..ea2f40165 --- /dev/null +++ b/prepare/tasks/response_assessment/rating/multi_turn_with_reference.py @@ -0,0 +1,15 @@ +from unitxt.blocks import FormTask +from unitxt.catalog import add_to_catalog + +add_to_catalog( + FormTask( + inputs={ + "dialog": "List[Tuple[str, str]]", + "reference_dialog": "List[Tuple[str, str]]", + }, + outputs={"rating": "float"}, + metrics=["metrics.spearman"], + ), + "tasks.response_assessment.rating.multi_turn_with_reference", + overwrite=True, +) diff --git a/prepare/tasks/response_assessment/rating/single_turn.py b/prepare/tasks/response_assessment/rating/single_turn.py new file mode 100644 index 000000000..277081f64 --- /dev/null +++ b/prepare/tasks/response_assessment/rating/single_turn.py @@ -0,0 +1,12 @@ +from unitxt.blocks import FormTask +from unitxt.catalog import add_to_catalog + +add_to_catalog( + FormTask( + inputs={"question": "str", "answer": "str"}, + outputs={"rating": "float"}, + metrics=["metrics.spearman"], + ), + "tasks.response_assessment.rating.single_turn", + overwrite=True, +) diff --git a/prepare/tasks/response_assessment/rating/single_turn_with_reference.py b/prepare/tasks/response_assessment/rating/single_turn_with_reference.py new file mode 100644 index 000000000..b77f573cd --- /dev/null +++ b/prepare/tasks/response_assessment/rating/single_turn_with_reference.py @@ -0,0 +1,12 @@ +from unitxt.blocks import FormTask +from unitxt.catalog import add_to_catalog + +add_to_catalog( + FormTask( + inputs={"question": "str", "answer": "str", "reference_answer": "str"}, + outputs={"rating": "float"}, + metrics=["metrics.spearman"], + ), + "tasks.response_assessment.rating.single_turn_with_reference", + overwrite=True, +) diff --git a/prepare/templates/llm_as_judge/mt_bench.py b/prepare/templates/llm_as_judge/mt_bench.py deleted file mode 100644 index d43eb7edc..000000000 --- a/prepare/templates/llm_as_judge/mt_bench.py +++ /dev/null @@ -1,22 +0,0 @@ -from unitxt.catalog import add_to_catalog -from unitxt.templates import InputOutputTemplate - -add_to_catalog( - InputOutputTemplate( - instruction="Please act as an impartial judge and evaluate the quality of the response" - " provided by an AI assistant to the user question displayed below. Your evaluation should" - " consider factors such as the helpfulness, relevance, accuracy, depth, creativity, " - " and level of detail of the response. Begin your evaluation by providing a short" - " explanation. Be as objective as possible. After providing your explanation, you must rate" - ' the response on a scale of 1 to 10 by strictly following this format: "[[rating]]",' - ' for example: "[[5]]".\n\n', - input_format="[Question]\n{question}\n\n[The Start of Assistant's Answer]" - "\n{model_output}\n[The End of Assistant's Answer]", - output_format="{rating_label}", - postprocessors=[ - r"processors.extract_mt_bench_judgment", - ], - ), - "templates.rag.model_response_assessment.llm_as_judge_using_mt_bench_template", - overwrite=True, -) diff --git a/prepare/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_reference_with_shuffle.py b/prepare/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_reference_with_shuffle.py new file mode 100644 index 000000000..33e4f9b43 --- /dev/null +++ b/prepare/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_reference_with_shuffle.py @@ -0,0 +1,62 @@ +from unitxt.catalog import add_to_catalog +from unitxt.templates import DialogFieldsData, DialogPairwiseChoiceTemplate + +add_to_catalog( + DialogPairwiseChoiceTemplate( + dialog_fields=[ + DialogFieldsData( + dialog_field="reference_dialog", + assistant_role_label="### Reference answer:", + user_role_label="### User:", + system_role_label="### System:", + ), + DialogFieldsData( + dialog_field="dialog_a", + assistant_role_label="### Assistant A:", + user_role_label="### User:", + system_role_label="### System:", + ), + DialogFieldsData( + dialog_field="dialog_b", + assistant_role_label="### Assistant B:", + user_role_label="### User:", + system_role_label="### System:", + ), + ], + turns_separator="\n\n", + label_separator="\n", + choice_a_field="dialog_a", + choice_b_field="dialog_b", + answer_field="winner", + choice_a_label="A", + choice_b_label="B", + choice_tie_label="C", + shuffle=True, + instruction="Please act as an impartial judge and evaluate the quality of the responses provided by two AI" + " assistants to the user questions. Your evaluation should consider correctness and helpfulness." + " You will be given reference answers, the assistant A's answers, the assistant B's answers." + " Your job is to determine which assistant provides correct and helpful answers to the second" + " user question. Begin your evaluation by comparing both assistants' answers with the reference" + " answers. Identify and correct any mistakes. Avoid any position biases and ensure that the order" + " in which the responses were presented does not influence your decision. Do not allow the length" + " of the responses to influence your evaluation. Do not favor certain names of the assistants." + " Be as objective as possible. After providing your explanation, output your final verdict by" + ' strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is' + ' better, and "[[C]]" for a tie.\n\n', + input_format="<|The Start of Reference Answer|>\n\n" + "{reference_dialog}\n\n" + "<|The End of Reference Answer|>\n\n\n" + "<|The Start of Assistant A's Conversation with User|>\n\n" + "{dialog_a}\n\n" + "<|The End of Assistant A's Conversation with User|>\n\n\n" + "<|The Start of Assistant B's Conversation with User|>\n\n" + "{dialog_b}\n\n" + "<|The End of Assistant B's Conversation with User|>", + output_format="[[{winner}]]", + postprocessors=[ + r"processors.extract_mt_bench_label_judgment", + ], + ), + "templates.response_assessment.pairwise_comparison.mt_bench_multi_turn_with_reference_with_shuffle", + overwrite=True, +) diff --git a/prepare/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_shuffle.py b/prepare/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_shuffle.py new file mode 100644 index 000000000..66bd6fc49 --- /dev/null +++ b/prepare/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_shuffle.py @@ -0,0 +1,54 @@ +from unitxt import add_to_catalog +from unitxt.templates import DialogFieldsData, DialogPairwiseChoiceTemplate + +add_to_catalog( + DialogPairwiseChoiceTemplate( + dialog_fields=[ + DialogFieldsData( + dialog_field="dialog_a", + assistant_role_label="### Assistant A:", + user_role_label="### User:", + system_role_label="### System:", + ), + DialogFieldsData( + dialog_field="dialog_b", + assistant_role_label="### Assistant B:", + user_role_label="### User:", + system_role_label="### System:", + ), + ], + turns_separator="\n\n", + label_separator="\n", + choice_a_field="dialog_a", + choice_b_field="dialog_b", + answer_field="winner", + choice_a_label="A", + choice_b_label="B", + choice_tie_label="C", + shuffle=True, + instruction="Please act as an impartial judge and evaluate the quality of the responses provided by two AI" + " assistants to the user questions. You should choose the assistant that follows the user's" + " instructions and answers the user's questions better. Your evaluation should consider factors" + " such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their" + " responses. You should focus on who provides a better answer to the second user question. " + "Begin your evaluation by comparing the responses of the two assistants and provide a short" + " explanation. Avoid any position biases and ensure that the order in which the responses were" + " presented does not influence your decision. Do not allow the length of the responses to" + " influence your evaluation. Do not favor certain names of the assistants. Be as objective as" + " possible. After providing your explanation, output your final verdict by strictly" + ' following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better,' + ' and "[[C]]" for a tie.\n\n', + input_format="<|The Start of Assistant A's Conversation with User|>\n\n" + "{dialog_a}\n\n" + "<|The End of Assistant A's Conversation with User|>\n\n\n" + "<|The Start of Assistant B's Conversation with User|>\n\n" + "{dialog_b}\n\n" + "<|The End of Assistant B's Conversation with User|>", + output_format="[[{winner}]]", + postprocessors=[ + r"processors.extract_mt_bench_label_judgment", + ], + ), + "templates.response_assessment.pairwise_comparison.mt_bench_multi_turn_with_shuffle", + overwrite=True, +) diff --git a/prepare/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_reference_with_shuffle.py b/prepare/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_reference_with_shuffle.py new file mode 100644 index 000000000..55e1712d4 --- /dev/null +++ b/prepare/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_reference_with_shuffle.py @@ -0,0 +1,35 @@ +from unitxt import add_to_catalog +from unitxt.templates import PairwiseChoiceTemplate + +add_to_catalog( + PairwiseChoiceTemplate( + choice_a_field="answer_a", + choice_b_field="answer_b", + answer_field="winner", + choice_a_label="A", + choice_b_label="B", + choice_tie_label="C", + shuffle=True, + instruction="Please act as an impartial judge and evaluate the quality of the responses provided by two AI" + " assistants to the user question displayed below. Your evaluation should consider correctness" + " and helpfulness. You will be given a reference answer, assistant A's answer, and assistant" + " B's answer. Your job is to evaluate which assistant's answer is better. Begin your evaluation" + " by comparing both assistants' answers with the reference answer. Identify and correct any" + " mistakes. Avoid any position biases and ensure that the order in which the responses were" + " presented does not influence your decision. Do not allow the length of the responses to" + " influence your evaluation. Do not favor certain names of the assistants. Be as objective" + " as possible. After providing your explanation, output your final verdict by strictly" + ' following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better,' + ' and "[[C]]" for a tie.\n\n', + input_format="[User Question]\n{question}\n\n" + "[The Start of Reference Answer]\n{reference_answer}\n[The End of Reference Answer]\n\n" + "[The Start of Assistant A's Answer]\n{answer_a}\n[The End of Assistant A's Answer]\n\n" + "[The Start of Assistant B's Answer]\n{answer_b}\n[The End of Assistant B's Answer]", + output_format="[[{winner}]]", + postprocessors=[ + r"processors.extract_mt_bench_label_judgment", + ], + ), + "templates.response_assessment.pairwise_comparison.mt_bench_single_turn_with_reference_with_shuffle", + overwrite=True, +) diff --git a/prepare/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_shuffle.py b/prepare/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_shuffle.py new file mode 100644 index 000000000..1a062f5e4 --- /dev/null +++ b/prepare/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_shuffle.py @@ -0,0 +1,34 @@ +from unitxt import add_to_catalog +from unitxt.templates import PairwiseChoiceTemplate + +add_to_catalog( + PairwiseChoiceTemplate( + choice_a_field="answer_a", + choice_b_field="answer_b", + answer_field="winner", + choice_a_label="A", + choice_b_label="B", + choice_tie_label="C", + shuffle=True, + instruction="Please act as an impartial judge and evaluate the quality of the responses provided by two" + " AI assistants to the user question displayed below. You should choose the assistant that" + " follows the user's instructions and answers the user's question better. Your evaluation should" + " consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of" + " detail of their responses. Begin your evaluation by comparing the two responses and provide a" + " short explanation. Avoid any position biases and ensure that the order in which the responses" + " were presented does not influence your decision. Do not allow the length of the responses to" + " influence your evaluation. Do not favor certain names of the assistants. Be as objective as" + " possible. After providing your explanation, output your final verdict by strictly following" + ' this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better,' + ' and "[[C]]" for a tie.\n\n', + input_format="[User Question]\n{question}\n\n" + "[The Start of Assistant A's Answer]\n{answer_a}\n[The End of Assistant A's Answer]\n\n" + "[The Start of Assistant B's Answer]\n{answer_b}\n[The End of Assistant B's Answer]", + output_format="[[{winner}]]", + postprocessors=[ + r"processors.extract_mt_bench_label_judgment", + ], + ), + "templates.response_assessment.pairwise_comparison.mt_bench_single_turn_with_shuffle", + overwrite=True, +) diff --git a/prepare/templates/response_assessment/rating/mt_bench_multi_turn.py b/prepare/templates/response_assessment/rating/mt_bench_multi_turn.py new file mode 100644 index 000000000..545c48b7f --- /dev/null +++ b/prepare/templates/response_assessment/rating/mt_bench_multi_turn.py @@ -0,0 +1,33 @@ +from unitxt.catalog import add_to_catalog +from unitxt.templates import DialogFieldsData, DialogTemplate + +add_to_catalog( + DialogTemplate( + dialog_fields=[ + DialogFieldsData( + dialog_field="dialog", + assistant_role_label="### Assistant A:", + user_role_label="### User:", + system_role_label="### System:", + ), + ], + turns_separator="\n\n", + label_separator="\n", + instruction="Please act as an impartial judge and evaluate the quality of the response provided by an" + " AI assistant to the user question displayed below. Your evaluation should consider factors" + " such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail" + " of the response. You evaluation should focus on the assistant's answer to the second user" + " question. Begin your evaluation by providing a short explanation. Be as objective as possible." + " After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly" + ' following this format: "[[rating]]", for example: "Rating: [[5]]".\n\n', + input_format="<|The Start of Assistant A's Conversation with User|>\n\n" + "{dialog}\n\n" + "<|The End of Assistant A's Conversation with User|>", + output_format="[[{rating}]]", + postprocessors=[ + r"processors.extract_mt_bench_rating_judgment", + ], + ), + "templates.response_assessment.rating.mt_bench_multi_turn", + overwrite=True, +) diff --git a/prepare/templates/response_assessment/rating/mt_bench_multi_turn_with_reference.py b/prepare/templates/response_assessment/rating/mt_bench_multi_turn_with_reference.py new file mode 100644 index 000000000..8f0d23978 --- /dev/null +++ b/prepare/templates/response_assessment/rating/mt_bench_multi_turn_with_reference.py @@ -0,0 +1,43 @@ +from unitxt.catalog import add_to_catalog +from unitxt.templates import DialogFieldsData, DialogTemplate + +add_to_catalog( + DialogTemplate( + dialog_fields=[ + DialogFieldsData( + dialog_field="reference_dialog", + assistant_role_label="### Reference answer:", + user_role_label="### User:", + system_role_label="### System:", + ), + DialogFieldsData( + dialog_field="dialog", + assistant_role_label="### Assistant A:", + user_role_label="### User:", + system_role_label="### System:", + ), + ], + turns_separator="\n\n", + label_separator="\n", + instruction="Please act as an impartial judge and evaluate the quality of the response provided by an AI" + " assistant to the user question. Your evaluation should consider correctness and helpfulness." + " You will be given a reference answer and the assistant's answer. You evaluation should focus" + " on the assistant's answer to the second question. Begin your evaluation by comparing the" + " assistant's answer with the reference answer. Identify and correct any mistakes." + " Be as objective as possible. After providing your explanation, you must rate the response on" + ' a scale of 1 to 10 by strictly following this format: "[[rating]]",' + ' for example: "Rating: [[5]]".\n\n', + input_format="<|The Start of Reference Answer|>\n\n" + "{reference_dialog}\n\n" + "<|The End of Reference Answer|>\n\n\n" + "<|The Start of Assistant A's Conversation with User|>\n\n" + "{dialog}\n\n" + "<|The End of Assistant A's Conversation with User|>", + output_format="[[{rating}]]", + postprocessors=[ + r"processors.extract_mt_bench_rating_judgment", + ], + ), + "templates.response_assessment.rating.mt_bench_multi_turn_with_reference", + overwrite=True, +) diff --git a/prepare/templates/response_assessment/rating/mt_bench_single_turn.py b/prepare/templates/response_assessment/rating/mt_bench_single_turn.py new file mode 100644 index 000000000..e6024c8c9 --- /dev/null +++ b/prepare/templates/response_assessment/rating/mt_bench_single_turn.py @@ -0,0 +1,22 @@ +from unitxt import add_to_catalog +from unitxt.templates import InputOutputTemplate + +add_to_catalog( + InputOutputTemplate( + instruction="Please act as an impartial judge and evaluate the quality of the response provided" + " by an AI assistant to the user question displayed below. Your evaluation should consider" + " factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of" + " detail of the response. Begin your evaluation by providing a short explanation. Be as" + " objective as possible. After providing your explanation, you must rate the response" + ' on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example:' + ' "Rating: [[5]]".\n\n', + input_format="[Question]\n{question}\n\n" + "[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", + output_format="[[{rating}]]", + postprocessors=[ + r"processors.extract_mt_bench_rating_judgment", + ], + ), + "templates.response_assessment.rating.mt_bench_single_turn", + overwrite=True, +) diff --git a/prepare/templates/response_assessment/rating/mt_bench_single_turn_with_reference.py b/prepare/templates/response_assessment/rating/mt_bench_single_turn_with_reference.py new file mode 100644 index 000000000..3888a8bdd --- /dev/null +++ b/prepare/templates/response_assessment/rating/mt_bench_single_turn_with_reference.py @@ -0,0 +1,23 @@ +from unitxt.catalog import add_to_catalog +from unitxt.templates import InputOutputTemplate + +add_to_catalog( + InputOutputTemplate( + instruction="Please act as an impartial judge and evaluate the quality of the response provided" + " by an AI assistant to the user question displayed below. Your evaluation should consider" + " correctness and helpfulness. You will be given a reference answer and the assistant's answer." + " Begin your evaluation by comparing the assistant's answer with the reference answer." + " Identify and correct any mistakes. Be as objective as possible. After providing your explanation," + " you must rate the response on a scale of 1 to 10 by strictly following this format:" + ' "[[rating]]", for example: "Rating: [[5]]".\n\n', + input_format="[Question]\n{question}\n\n" + "[The Start of Reference Answer]\n{reference_answer}\n[The End of Reference Answer]\n\n" + "[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", + output_format="[[{rating}]]", + postprocessors=[ + r"processors.extract_mt_bench_rating_judgment", + ], + ), + "templates.response_assessment.rating.mt_bench_single_turn_with_reference", + overwrite=True, +) diff --git a/requirements/tests.rqr b/requirements/tests.rqr index ca6730c33..4a3f0f86d 100644 --- a/requirements/tests.rqr +++ b/requirements/tests.rqr @@ -16,3 +16,5 @@ llama-index-core llama-index-llms-openai pytrec-eval SentencePiece +openai +ibm-generative-ai \ No newline at end of file diff --git a/src/unitxt/catalog/cards/dynamic_cards_for_llm_judges/rating/single_turn.json b/src/unitxt/catalog/cards/dynamic_cards_for_llm_judges/rating/single_turn.json new file mode 100644 index 000000000..add45855e --- /dev/null +++ b/src/unitxt/catalog/cards/dynamic_cards_for_llm_judges/rating/single_turn.json @@ -0,0 +1,6 @@ +{ + "type": "task_card", + "loader": null, + "preprocess_steps": [], + "task": "tasks.response_assessment.rating.single_turn" +} diff --git a/src/unitxt/catalog/cards/dynamic_cards_for_llm_judges/rating/single_turn_with_reference.json b/src/unitxt/catalog/cards/dynamic_cards_for_llm_judges/rating/single_turn_with_reference.json new file mode 100644 index 000000000..673925217 --- /dev/null +++ b/src/unitxt/catalog/cards/dynamic_cards_for_llm_judges/rating/single_turn_with_reference.json @@ -0,0 +1,6 @@ +{ + "type": "task_card", + "loader": null, + "preprocess_steps": [], + "task": "tasks.response_assessment.rating.single_turn_with_reference" +} diff --git a/src/unitxt/catalog/cards/mt_bench/generation/english_single_turn.json b/src/unitxt/catalog/cards/mt_bench/generation/english_single_turn.json new file mode 100644 index 000000000..7466583b6 --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/generation/english_single_turn.json @@ -0,0 +1,41 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "dim/mt_bench_en", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "copy_fields", + "field_to_field": { + "turns/0": "turns" + } + }, + { + "type": "rename_fields", + "field_to_field": { + "turns": "input", + "category": "group" + } + }, + { + "type": "add_fields", + "fields": { + "output": "None", + "type_of_input": "question", + "type_of_output": "answer" + } + } + ], + "task": "tasks.generation", + "templates": [ + "templates.empty" + ] +} diff --git a/src/unitxt/catalog/cards/mt_bench/generation/japanese_single_turn.json b/src/unitxt/catalog/cards/mt_bench/generation/japanese_single_turn.json new file mode 100644 index 000000000..7d6c4b448 --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/generation/japanese_single_turn.json @@ -0,0 +1,41 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "shi3z/MTbenchJapanese", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "copy_fields", + "field_to_field": { + "turns/0": "turns" + } + }, + { + "type": "rename_fields", + "field_to_field": { + "turns": "input", + "category": "group" + } + }, + { + "type": "add_fields", + "fields": { + "output": "None", + "type_of_input": "question", + "type_of_output": "answer" + } + } + ], + "task": "tasks.generation", + "templates": [ + "templates.empty" + ] +} diff --git a/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_gpt4_judgement.json b/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_gpt4_judgement.json new file mode 100644 index 000000000..9d304e9c7 --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_gpt4_judgement.json @@ -0,0 +1,88 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "filter_by_condition", + "values": { + "turn": 2 + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "reference": "[]" + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "winner": [ + "model_1", + "tie", + "model_2" + ] + }, + "condition": "in" + }, + { + "type": "map_instance_values", + "mappers": { + "winner": { + "model_1": "choice_a", + "model_2": "choice_b", + "tie": "tie" + } + } + }, + { + "type": "rename_fields", + "field_to_field": { + "category": "group" + } + }, + { + "type": "literal_eval", + "to_field": "model_input", + "field": "model_input" + }, + { + "type": "literal_eval", + "to_field": "model_1_output", + "field": "model_1_output" + }, + { + "type": "literal_eval", + "to_field": "model_2_output", + "field": "model_2_output" + }, + { + "type": "interleave_lists_to_dialog_operator", + "user_turns_field": "model_input", + "assistant_turns_field": "model_1_output", + "to_field": "dialog_a" + }, + { + "type": "interleave_lists_to_dialog_operator", + "user_turns_field": "model_input", + "assistant_turns_field": "model_2_output", + "to_field": "dialog_b" + } + ], + "task": "tasks.response_assessment.pairwise_comparison.multi_turn", + "templates": [ + "templates.response_assessment.pairwise_comparison.mt_bench_multi_turn_with_shuffle" + ] +} diff --git a/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_with_reference_gpt4_judgement.json b/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_with_reference_gpt4_judgement.json new file mode 100644 index 000000000..3ed076bac --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/multi_turn_with_reference_gpt4_judgement.json @@ -0,0 +1,99 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "filter_by_condition", + "values": { + "turn": 2 + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "reference": "[]" + }, + "condition": "ne" + }, + { + "type": "filter_by_condition", + "values": { + "winner": [ + "model_1", + "tie", + "model_2" + ] + }, + "condition": "in" + }, + { + "type": "map_instance_values", + "mappers": { + "winner": { + "model_1": "choice_a", + "model_2": "choice_b", + "tie": "tie" + } + } + }, + { + "type": "rename_fields", + "field_to_field": { + "category": "group" + } + }, + { + "type": "literal_eval", + "to_field": "model_input", + "field": "model_input" + }, + { + "type": "literal_eval", + "to_field": "model_1_output", + "field": "model_1_output" + }, + { + "type": "literal_eval", + "to_field": "model_2_output", + "field": "model_2_output" + }, + { + "type": "literal_eval", + "to_field": "reference", + "field": "reference" + }, + { + "type": "interleave_lists_to_dialog_operator", + "user_turns_field": "model_input", + "assistant_turns_field": "model_1_output", + "to_field": "dialog_a" + }, + { + "type": "interleave_lists_to_dialog_operator", + "user_turns_field": "model_input", + "assistant_turns_field": "model_2_output", + "to_field": "dialog_b" + }, + { + "type": "interleave_lists_to_dialog_operator", + "user_turns_field": "model_input", + "assistant_turns_field": "reference", + "to_field": "reference_dialog" + } + ], + "task": "tasks.response_assessment.pairwise_comparison.multi_turn_with_reference", + "templates": [ + "templates.response_assessment.pairwise_comparison.mt_bench_multi_turn_with_reference_with_shuffle" + ] +} diff --git a/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_gpt4_judgement.json b/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_gpt4_judgement.json new file mode 100644 index 000000000..c2c5d5213 --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_gpt4_judgement.json @@ -0,0 +1,97 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "filter_by_condition", + "values": { + "turn": 1 + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "reference": "[]" + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "winner": [ + "model_1", + "tie", + "model_2" + ] + }, + "condition": "in" + }, + { + "type": "map_instance_values", + "mappers": { + "winner": { + "model_1": "choice_a", + "model_2": "choice_b", + "tie": "tie" + } + } + }, + { + "type": "rename_fields", + "field_to_field": { + "model_input": "question", + "model_1_output": "answer_a", + "model_2_output": "answer_b", + "category": "group" + } + }, + { + "type": "literal_eval", + "to_field": "question", + "field": "question" + }, + { + "type": "copy_fields", + "field_to_field": { + "question/0": "question" + } + }, + { + "type": "literal_eval", + "to_field": "answer_a", + "field": "answer_a" + }, + { + "type": "copy_fields", + "field_to_field": { + "answer_a/0": "answer_a" + } + }, + { + "type": "literal_eval", + "to_field": "answer_b", + "field": "answer_b" + }, + { + "type": "copy_fields", + "field_to_field": { + "answer_b/0": "answer_b" + } + } + ], + "task": "tasks.response_assessment.pairwise_comparison.single_turn", + "templates": [ + "templates.response_assessment.pairwise_comparison.mt_bench_single_turn_with_shuffle" + ] +} diff --git a/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_with_reference_gpt4_judgement.json b/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_with_reference_gpt4_judgement.json new file mode 100644 index 000000000..51b5bf574 --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/response_assessment/pairwise_comparison/single_turn_with_reference_gpt4_judgement.json @@ -0,0 +1,109 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "OfirArviv/mt_bench_pairwise_comparison_gpt4_judgments", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "filter_by_condition", + "values": { + "turn": 1 + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "reference": "[]" + }, + "condition": "ne" + }, + { + "type": "filter_by_condition", + "values": { + "winner": [ + "model_1", + "tie", + "model_2" + ] + }, + "condition": "in" + }, + { + "type": "map_instance_values", + "mappers": { + "winner": { + "model_1": "choice_a", + "model_2": "choice_b", + "tie": "tie" + } + } + }, + { + "type": "rename_fields", + "field_to_field": { + "model_input": "question", + "model_1_output": "answer_a", + "model_2_output": "answer_b", + "reference": "reference_answer", + "category": "group" + } + }, + { + "type": "literal_eval", + "to_field": "question", + "field": "question" + }, + { + "type": "copy_fields", + "field_to_field": { + "question/0": "question" + } + }, + { + "type": "literal_eval", + "to_field": "answer_a", + "field": "answer_a" + }, + { + "type": "copy_fields", + "field_to_field": { + "answer_a/0": "answer_a" + } + }, + { + "type": "literal_eval", + "to_field": "answer_b", + "field": "answer_b" + }, + { + "type": "copy_fields", + "field_to_field": { + "answer_b/0": "answer_b" + } + }, + { + "type": "literal_eval", + "to_field": "reference_answer", + "field": "reference_answer" + }, + { + "type": "copy_fields", + "field_to_field": { + "reference_answer/0": "reference_answer" + } + } + ], + "task": "tasks.response_assessment.pairwise_comparison.single_turn_with_reference", + "templates": [ + "templates.response_assessment.pairwise_comparison.mt_bench_single_turn_with_reference_with_shuffle" + ] +} diff --git a/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/multi_turn_gpt4_judgement.json b/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/multi_turn_gpt4_judgement.json new file mode 100644 index 000000000..baed8bf69 --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/multi_turn_gpt4_judgement.json @@ -0,0 +1,57 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "OfirArviv/mt_bench_single_score_gpt4_judgement", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "filter_by_condition", + "values": { + "turn": 2 + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "reference": "[]" + }, + "condition": "eq" + }, + { + "type": "rename_fields", + "field_to_field": { + "score": "rating", + "category": "group" + } + }, + { + "type": "literal_eval", + "to_field": "model_input", + "field": "model_input" + }, + { + "type": "literal_eval", + "to_field": "model_output", + "field": "model_output" + }, + { + "type": "interleave_lists_to_dialog_operator", + "user_turns_field": "model_input", + "assistant_turns_field": "model_output", + "to_field": "dialog" + } + ], + "task": "tasks.response_assessment.rating.multi_turn", + "templates": [ + "templates.response_assessment.rating.mt_bench_multi_turn" + ] +} diff --git a/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/multi_turn_with_reference_gpt4_judgement.json b/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/multi_turn_with_reference_gpt4_judgement.json new file mode 100644 index 000000000..db73ee548 --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/multi_turn_with_reference_gpt4_judgement.json @@ -0,0 +1,68 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "OfirArviv/mt_bench_single_score_gpt4_judgement", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "filter_by_condition", + "values": { + "turn": 2 + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "reference": "[]" + }, + "condition": "ne" + }, + { + "type": "rename_fields", + "field_to_field": { + "score": "rating", + "category": "group" + } + }, + { + "type": "literal_eval", + "to_field": "model_input", + "field": "model_input" + }, + { + "type": "literal_eval", + "to_field": "model_output", + "field": "model_output" + }, + { + "type": "literal_eval", + "to_field": "reference", + "field": "reference" + }, + { + "type": "interleave_lists_to_dialog_operator", + "user_turns_field": "model_input", + "assistant_turns_field": "model_output", + "to_field": "dialog" + }, + { + "type": "interleave_lists_to_dialog_operator", + "user_turns_field": "model_input", + "assistant_turns_field": "reference", + "to_field": "reference_dialog" + } + ], + "task": "tasks.response_assessment.rating.multi_turn_with_reference", + "templates": [ + "templates.response_assessment.rating.mt_bench_multi_turn_with_reference" + ] +} diff --git a/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/single_turn_gpt4_judgement.json b/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/single_turn_gpt4_judgement.json new file mode 100644 index 000000000..b6bd1f89c --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/single_turn_gpt4_judgement.json @@ -0,0 +1,65 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "OfirArviv/mt_bench_single_score_gpt4_judgement", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "filter_by_condition", + "values": { + "turn": 1 + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "reference": "[]" + }, + "condition": "eq" + }, + { + "type": "rename_fields", + "field_to_field": { + "model_input": "question", + "score": "rating", + "category": "group", + "model_output": "answer" + } + }, + { + "type": "literal_eval", + "to_field": "question", + "field": "question" + }, + { + "type": "copy_fields", + "field_to_field": { + "question/0": "question" + } + }, + { + "type": "literal_eval", + "to_field": "answer", + "field": "answer" + }, + { + "type": "copy_fields", + "field_to_field": { + "answer/0": "answer" + } + } + ], + "task": "tasks.response_assessment.rating.single_turn", + "templates": [ + "templates.response_assessment.rating.mt_bench_single_turn" + ] +} diff --git a/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/single_turn_with_reference_gpt4_judgement.json b/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/single_turn_with_reference_gpt4_judgement.json new file mode 100644 index 000000000..0a866e999 --- /dev/null +++ b/src/unitxt/catalog/cards/mt_bench/response_assessment/rating/single_turn_with_reference_gpt4_judgement.json @@ -0,0 +1,77 @@ +{ + "type": "task_card", + "loader": { + "type": "load_hf", + "path": "OfirArviv/mt_bench_single_score_gpt4_judgement", + "split": "train" + }, + "preprocess_steps": [ + { + "type": "rename_splits", + "mapper": { + "train": "test" + } + }, + { + "type": "filter_by_condition", + "values": { + "turn": 1 + }, + "condition": "eq" + }, + { + "type": "filter_by_condition", + "values": { + "reference": "[]" + }, + "condition": "ne" + }, + { + "type": "rename_fields", + "field_to_field": { + "model_input": "question", + "score": "rating", + "category": "group", + "reference": "reference_answer", + "model_output": "answer" + } + }, + { + "type": "literal_eval", + "to_field": "question", + "field": "question" + }, + { + "type": "copy_fields", + "field_to_field": { + "question/0": "question" + } + }, + { + "type": "literal_eval", + "to_field": "answer", + "field": "answer" + }, + { + "type": "copy_fields", + "field_to_field": { + "answer/0": "answer" + } + }, + { + "type": "literal_eval", + "to_field": "reference_answer", + "field": "reference_answer" + }, + { + "type": "copy_fields", + "field_to_field": { + "reference_answer/0": "reference_answer" + } + } + ], + "task": "tasks.response_assessment.rating.single_turn_with_reference", + "templates": [ + "templates.response_assessment.rating.mt_bench_single_turn_with_reference" + ] +} diff --git a/src/unitxt/catalog/cards/rag/model_response_assessment/llm_as_judge_using_mt_bench_template.json b/src/unitxt/catalog/cards/rag/model_response_assessment/llm_as_judge_using_mt_bench_template.json deleted file mode 100644 index ee69f267f..000000000 --- a/src/unitxt/catalog/cards/rag/model_response_assessment/llm_as_judge_using_mt_bench_template.json +++ /dev/null @@ -1,9 +0,0 @@ -{ - "type": "task_card", - "loader": null, - "preprocess_steps": [], - "task": "tasks.rag.model_response_assessment", - "templates": [ - "templates.rag.model_response_assessment.llm_as_judge_using_mt_bench_template" - ] -} diff --git a/src/unitxt/catalog/formats/llama3_chat.json b/src/unitxt/catalog/formats/llama3_chat.json new file mode 100644 index 000000000..031c4a2d4 --- /dev/null +++ b/src/unitxt/catalog/formats/llama3_chat.json @@ -0,0 +1,5 @@ +{ + "type": "system_format", + "demo_format": "{source}\n\n{target_prefix}{target}\n\n", + "model_input_format": "<|begin_of_text|><|eot_id|><|start_header_id|>user<|end_header_id|>\n{instruction}{demos}{source}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{target_prefix}" +} diff --git a/src/unitxt/catalog/formats/llama3_chat_with_system_prompt.json b/src/unitxt/catalog/formats/llama3_chat_with_system_prompt.json new file mode 100644 index 000000000..02b312f7d --- /dev/null +++ b/src/unitxt/catalog/formats/llama3_chat_with_system_prompt.json @@ -0,0 +1,5 @@ +{ + "type": "system_format", + "demo_format": "{source}\n\n{target_prefix}{target}\n\n", + "model_input_format": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n{instruction}{demos}{source}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{target_prefix}" +} diff --git a/src/unitxt/catalog/formats/models/llama3.json b/src/unitxt/catalog/formats/models/llama3.json deleted file mode 100644 index 9d0dc1944..000000000 --- a/src/unitxt/catalog/formats/models/llama3.json +++ /dev/null @@ -1,4 +0,0 @@ -{ - "type": "system_format", - "model_input_format": "<|start_header_id|>system<|end_header_id|>\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n{instruction}\\N{source}\\N{target_prefix}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n" -} diff --git a/src/unitxt/catalog/formats/models/phi_3.json b/src/unitxt/catalog/formats/models/phi_3.json new file mode 100644 index 000000000..bc4ecaea2 --- /dev/null +++ b/src/unitxt/catalog/formats/models/phi_3.json @@ -0,0 +1,5 @@ +{ + "type": "system_format", + "demo_format": "<|user|>\n{instruction}{source}<|end|>\n<|assistant|>\n{target_prefix}{target}<|end|>\n", + "model_input_format": "<|user|>\n{system_prompt}<|end|>\n{demos}<|user|>\n{instruction}{source}<|end|>\n<|assistant|>\n{target_prefix}" +} diff --git a/src/unitxt/catalog/metrics/llm_as_judge/rating/llama_3_70b_instruct_ibm_genai_template_mt_bench_single_turn.json b/src/unitxt/catalog/metrics/llm_as_judge/rating/llama_3_70b_instruct_ibm_genai_template_mt_bench_single_turn.json new file mode 100644 index 000000000..0a9c933cc --- /dev/null +++ b/src/unitxt/catalog/metrics/llm_as_judge/rating/llama_3_70b_instruct_ibm_genai_template_mt_bench_single_turn.json @@ -0,0 +1,15 @@ +{ + "type": "llm_as_judge", + "inference_model": { + "type": "ibm_gen_ai_inference_engine", + "model_name": "meta-llama/llama-3-70b-instruct", + "parameters": { + "type": "ibm_gen_ai_inference_engine_params", + "max_new_tokens": 252 + } + }, + "template": "templates.response_assessment.rating.mt_bench_single_turn", + "task": "rating.single_turn", + "format": "formats.llama3_chat", + "main_score": "llama_3_70b_instruct_ibm_genai_template_mt_bench_single_turn" +} diff --git a/src/unitxt/catalog/metrics/llm_as_judge/rating/llama_3_8b_instruct_ibm_genai_template_mt_bench_single_turn.json b/src/unitxt/catalog/metrics/llm_as_judge/rating/llama_3_8b_instruct_ibm_genai_template_mt_bench_single_turn.json new file mode 100644 index 000000000..1511fb850 --- /dev/null +++ b/src/unitxt/catalog/metrics/llm_as_judge/rating/llama_3_8b_instruct_ibm_genai_template_mt_bench_single_turn.json @@ -0,0 +1,15 @@ +{ + "type": "llm_as_judge", + "inference_model": { + "type": "ibm_gen_ai_inference_engine", + "model_name": "meta-llama/llama-3-8b-instruct", + "parameters": { + "type": "ibm_gen_ai_inference_engine_params", + "max_new_tokens": 252 + } + }, + "template": "templates.response_assessment.rating.mt_bench_single_turn", + "task": "rating.single_turn", + "format": "formats.llama3_chat", + "main_score": "llama_3_8b_instruct_ibm_genai_template_mt_bench_single_turn" +} diff --git a/src/unitxt/catalog/metrics/llm_as_judge/rating/mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn.json b/src/unitxt/catalog/metrics/llm_as_judge/rating/mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn.json new file mode 100644 index 000000000..019e1f378 --- /dev/null +++ b/src/unitxt/catalog/metrics/llm_as_judge/rating/mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn.json @@ -0,0 +1,13 @@ +{ + "type": "llm_as_judge", + "inference_model": { + "type": "hf_pipeline_based_inference_engine", + "model_name": "mistralai/Mistral-7B-Instruct-v0.2", + "max_new_tokens": 256, + "use_fp16": true + }, + "template": "templates.response_assessment.rating.mt_bench_single_turn", + "task": "rating.single_turn", + "format": "formats.models.mistral.instruction", + "main_score": "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn" +} diff --git a/src/unitxt/catalog/metrics/rag/model_response_assessment/llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template.json b/src/unitxt/catalog/metrics/rag/model_response_assessment/llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template.json deleted file mode 100644 index d335df475..000000000 --- a/src/unitxt/catalog/metrics/rag/model_response_assessment/llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template.json +++ /dev/null @@ -1,10 +0,0 @@ -{ - "type": "llm_as_judge", - "inference_model": { - "type": "hf_pipeline_based_inference_engine", - "model_name": "google/flan-t5-large", - "max_new_tokens": 32 - }, - "recipe": "card=cards.rag.model_response_assessment.llm_as_judge_using_mt_bench_template,template=templates.rag.model_response_assessment.llm_as_judge_using_mt_bench_template,demos_pool_size=0,num_demos=0", - "main_score": "llm_as_judge_by_flan_t5_large_on_hf_pipeline_using_mt_bench_template" -} diff --git a/src/unitxt/catalog/processors/extract_mt_bench_judgment.json b/src/unitxt/catalog/processors/extract_mt_bench_label_judgment.json similarity index 65% rename from src/unitxt/catalog/processors/extract_mt_bench_judgment.json rename to src/unitxt/catalog/processors/extract_mt_bench_label_judgment.json index 24b99709b..0b4492220 100644 --- a/src/unitxt/catalog/processors/extract_mt_bench_judgment.json +++ b/src/unitxt/catalog/processors/extract_mt_bench_label_judgment.json @@ -2,11 +2,11 @@ "type": "sequential_operator", "steps": [ { - "type": "extract_mt_bench_judgment", + "type": "extract_mt_bench_label_judgment", "field": "prediction" }, { - "type": "extract_mt_bench_judgment", + "type": "extract_mt_bench_label_judgment", "field": "references", "process_every_value": true } diff --git a/src/unitxt/catalog/processors/extract_mt_bench_rating_judgment.json b/src/unitxt/catalog/processors/extract_mt_bench_rating_judgment.json new file mode 100644 index 000000000..51a5e94a6 --- /dev/null +++ b/src/unitxt/catalog/processors/extract_mt_bench_rating_judgment.json @@ -0,0 +1,14 @@ +{ + "type": "sequential_operator", + "steps": [ + { + "type": "extract_mt_bench_rating_judgment", + "field": "prediction" + }, + { + "type": "extract_mt_bench_rating_judgment", + "field": "references", + "process_every_value": true + } + ] +} diff --git a/src/unitxt/catalog/tasks/rag/model_response_assessment.json b/src/unitxt/catalog/tasks/rag/model_response_assessment.json deleted file mode 100644 index 0403cfeab..000000000 --- a/src/unitxt/catalog/tasks/rag/model_response_assessment.json +++ /dev/null @@ -1,13 +0,0 @@ -{ - "type": "form_task", - "inputs": [ - "question", - "model_output" - ], - "outputs": [ - "rating_label" - ], - "metrics": [ - "metrics.spearman" - ] -} diff --git a/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/multi_turn.json b/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/multi_turn.json new file mode 100644 index 000000000..0d30cdb40 --- /dev/null +++ b/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/multi_turn.json @@ -0,0 +1,13 @@ +{ + "type": "form_task", + "inputs": { + "dialog_a": "List[Tuple[str, str]]", + "dialog_b": "List[Tuple[str, str]]" + }, + "outputs": { + "winner": "str" + }, + "metrics": [ + "metrics.accuracy" + ] +} diff --git a/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/multi_turn_with_reference.json b/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/multi_turn_with_reference.json new file mode 100644 index 000000000..03c366183 --- /dev/null +++ b/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/multi_turn_with_reference.json @@ -0,0 +1,14 @@ +{ + "type": "form_task", + "inputs": { + "dialog_a": "List[Tuple[str, str]]", + "dialog_b": "List[Tuple[str, str]]", + "reference_dialog": "List[Tuple[str, str]]" + }, + "outputs": { + "winner": "str" + }, + "metrics": [ + "metrics.accuracy" + ] +} diff --git a/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/single_turn.json b/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/single_turn.json new file mode 100644 index 000000000..da17d84d8 --- /dev/null +++ b/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/single_turn.json @@ -0,0 +1,14 @@ +{ + "type": "form_task", + "inputs": { + "question": "str", + "answer_a": "str", + "answer_b": "str" + }, + "outputs": { + "winner": "str" + }, + "metrics": [ + "metrics.accuracy" + ] +} diff --git a/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/single_turn_with_reference.json b/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/single_turn_with_reference.json new file mode 100644 index 000000000..0194487a2 --- /dev/null +++ b/src/unitxt/catalog/tasks/response_assessment/pairwise_comparison/single_turn_with_reference.json @@ -0,0 +1,15 @@ +{ + "type": "form_task", + "inputs": { + "question": "str", + "answer_a": "str", + "answer_b": "str", + "reference_answer": "str" + }, + "outputs": { + "winner": "str" + }, + "metrics": [ + "metrics.accuracy" + ] +} diff --git a/src/unitxt/catalog/tasks/response_assessment/rating/multi_turn.json b/src/unitxt/catalog/tasks/response_assessment/rating/multi_turn.json new file mode 100644 index 000000000..86201c519 --- /dev/null +++ b/src/unitxt/catalog/tasks/response_assessment/rating/multi_turn.json @@ -0,0 +1,12 @@ +{ + "type": "form_task", + "inputs": { + "dialog": "List[Tuple[str, str]]" + }, + "outputs": { + "rating": "float" + }, + "metrics": [ + "metrics.spearman" + ] +} diff --git a/src/unitxt/catalog/tasks/response_assessment/rating/multi_turn_with_reference.json b/src/unitxt/catalog/tasks/response_assessment/rating/multi_turn_with_reference.json new file mode 100644 index 000000000..7244a3d1c --- /dev/null +++ b/src/unitxt/catalog/tasks/response_assessment/rating/multi_turn_with_reference.json @@ -0,0 +1,13 @@ +{ + "type": "form_task", + "inputs": { + "dialog": "List[Tuple[str, str]]", + "reference_dialog": "List[Tuple[str, str]]" + }, + "outputs": { + "rating": "float" + }, + "metrics": [ + "metrics.spearman" + ] +} diff --git a/src/unitxt/catalog/tasks/response_assessment/rating/single_turn.json b/src/unitxt/catalog/tasks/response_assessment/rating/single_turn.json new file mode 100644 index 000000000..aa9b9df49 --- /dev/null +++ b/src/unitxt/catalog/tasks/response_assessment/rating/single_turn.json @@ -0,0 +1,13 @@ +{ + "type": "form_task", + "inputs": { + "question": "str", + "answer": "str" + }, + "outputs": { + "rating": "float" + }, + "metrics": [ + "metrics.spearman" + ] +} diff --git a/src/unitxt/catalog/tasks/response_assessment/rating/single_turn_with_reference.json b/src/unitxt/catalog/tasks/response_assessment/rating/single_turn_with_reference.json new file mode 100644 index 000000000..e2cef9449 --- /dev/null +++ b/src/unitxt/catalog/tasks/response_assessment/rating/single_turn_with_reference.json @@ -0,0 +1,14 @@ +{ + "type": "form_task", + "inputs": { + "question": "str", + "answer": "str", + "reference_answer": "str" + }, + "outputs": { + "rating": "float" + }, + "metrics": [ + "metrics.spearman" + ] +} diff --git a/src/unitxt/catalog/templates/rag/model_response_assessment/llm_as_judge_using_mt_bench_template.json b/src/unitxt/catalog/templates/rag/model_response_assessment/llm_as_judge_using_mt_bench_template.json deleted file mode 100644 index 123d2e38a..000000000 --- a/src/unitxt/catalog/templates/rag/model_response_assessment/llm_as_judge_using_mt_bench_template.json +++ /dev/null @@ -1,9 +0,0 @@ -{ - "type": "input_output_template", - "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"[[5]]\".\n\n", - "input_format": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{model_output}\n[The End of Assistant's Answer]", - "output_format": "{rating_label}", - "postprocessors": [ - "processors.extract_mt_bench_judgment" - ] -} diff --git a/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_reference_with_shuffle.json b/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_reference_with_shuffle.json new file mode 100644 index 000000000..d70575c21 --- /dev/null +++ b/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_reference_with_shuffle.json @@ -0,0 +1,41 @@ +{ + "type": "dialog_pairwise_choice_template", + "dialog_fields": [ + { + "type": "dialog_fields_data", + "dialog_field": "reference_dialog", + "assistant_role_label": "### Reference answer:", + "user_role_label": "### User:", + "system_role_label": "### System:" + }, + { + "type": "dialog_fields_data", + "dialog_field": "dialog_a", + "assistant_role_label": "### Assistant A:", + "user_role_label": "### User:", + "system_role_label": "### System:" + }, + { + "type": "dialog_fields_data", + "dialog_field": "dialog_b", + "assistant_role_label": "### Assistant B:", + "user_role_label": "### User:", + "system_role_label": "### System:" + } + ], + "turns_separator": "\n\n", + "label_separator": "\n", + "choice_a_field": "dialog_a", + "choice_b_field": "dialog_b", + "answer_field": "winner", + "choice_a_label": "A", + "choice_b_label": "B", + "choice_tie_label": "C", + "shuffle": true, + "instruction": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user questions. Your evaluation should consider correctness and helpfulness. You will be given reference answers, the assistant A's answers, the assistant B's answers. Your job is to determine which assistant provides correct and helpful answers to the second user question. Begin your evaluation by comparing both assistants' answers with the reference answers. Identify and correct any mistakes. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie.\n\n", + "input_format": "<|The Start of Reference Answer|>\n\n{reference_dialog}\n\n<|The End of Reference Answer|>\n\n\n<|The Start of Assistant A's Conversation with User|>\n\n{dialog_a}\n\n<|The End of Assistant A's Conversation with User|>\n\n\n<|The Start of Assistant B's Conversation with User|>\n\n{dialog_b}\n\n<|The End of Assistant B's Conversation with User|>", + "output_format": "[[{winner}]]", + "postprocessors": [ + "processors.extract_mt_bench_label_judgment" + ] +} diff --git a/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_shuffle.json b/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_shuffle.json new file mode 100644 index 000000000..f5d7d9603 --- /dev/null +++ b/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_multi_turn_with_shuffle.json @@ -0,0 +1,34 @@ +{ + "type": "dialog_pairwise_choice_template", + "dialog_fields": [ + { + "type": "dialog_fields_data", + "dialog_field": "dialog_a", + "assistant_role_label": "### Assistant A:", + "user_role_label": "### User:", + "system_role_label": "### System:" + }, + { + "type": "dialog_fields_data", + "dialog_field": "dialog_b", + "assistant_role_label": "### Assistant B:", + "user_role_label": "### User:", + "system_role_label": "### System:" + } + ], + "turns_separator": "\n\n", + "label_separator": "\n", + "choice_a_field": "dialog_a", + "choice_b_field": "dialog_b", + "answer_field": "winner", + "choice_a_label": "A", + "choice_b_label": "B", + "choice_tie_label": "C", + "shuffle": true, + "instruction": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user questions. You should choose the assistant that follows the user's instructions and answers the user's questions better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. You should focus on who provides a better answer to the second user question. Begin your evaluation by comparing the responses of the two assistants and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie.\n\n", + "input_format": "<|The Start of Assistant A's Conversation with User|>\n\n{dialog_a}\n\n<|The End of Assistant A's Conversation with User|>\n\n\n<|The Start of Assistant B's Conversation with User|>\n\n{dialog_b}\n\n<|The End of Assistant B's Conversation with User|>", + "output_format": "[[{winner}]]", + "postprocessors": [ + "processors.extract_mt_bench_label_judgment" + ] +} diff --git a/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_reference_with_shuffle.json b/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_reference_with_shuffle.json new file mode 100644 index 000000000..8eeadfa72 --- /dev/null +++ b/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_reference_with_shuffle.json @@ -0,0 +1,16 @@ +{ + "type": "pairwise_choice_template", + "choice_a_field": "answer_a", + "choice_b_field": "answer_b", + "answer_field": "winner", + "choice_a_label": "A", + "choice_b_label": "B", + "choice_tie_label": "C", + "shuffle": true, + "instruction": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer, assistant A's answer, and assistant B's answer. Your job is to evaluate which assistant's answer is better. Begin your evaluation by comparing both assistants' answers with the reference answer. Identify and correct any mistakes. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie.\n\n", + "input_format": "[User Question]\n{question}\n\n[The Start of Reference Answer]\n{reference_answer}\n[The End of Reference Answer]\n\n[The Start of Assistant A's Answer]\n{answer_a}\n[The End of Assistant A's Answer]\n\n[The Start of Assistant B's Answer]\n{answer_b}\n[The End of Assistant B's Answer]", + "output_format": "[[{winner}]]", + "postprocessors": [ + "processors.extract_mt_bench_label_judgment" + ] +} diff --git a/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_shuffle.json b/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_shuffle.json new file mode 100644 index 000000000..b713589e9 --- /dev/null +++ b/src/unitxt/catalog/templates/response_assessment/pairwise_comparison/mt_bench_single_turn_with_shuffle.json @@ -0,0 +1,16 @@ +{ + "type": "pairwise_choice_template", + "choice_a_field": "answer_a", + "choice_b_field": "answer_b", + "answer_field": "winner", + "choice_a_label": "A", + "choice_b_label": "B", + "choice_tie_label": "C", + "shuffle": true, + "instruction": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie.\n\n", + "input_format": "[User Question]\n{question}\n\n[The Start of Assistant A's Answer]\n{answer_a}\n[The End of Assistant A's Answer]\n\n[The Start of Assistant B's Answer]\n{answer_b}\n[The End of Assistant B's Answer]", + "output_format": "[[{winner}]]", + "postprocessors": [ + "processors.extract_mt_bench_label_judgment" + ] +} diff --git a/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_multi_turn.json b/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_multi_turn.json new file mode 100644 index 000000000..557a80e49 --- /dev/null +++ b/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_multi_turn.json @@ -0,0 +1,20 @@ +{ + "type": "dialog_template", + "dialog_fields": [ + { + "type": "dialog_fields_data", + "dialog_field": "dialog", + "assistant_role_label": "### Assistant A:", + "user_role_label": "### User:", + "system_role_label": "### System:" + } + ], + "turns_separator": "\n\n", + "label_separator": "\n", + "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. You evaluation should focus on the assistant's answer to the second user question. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n", + "input_format": "<|The Start of Assistant A's Conversation with User|>\n\n{dialog}\n\n<|The End of Assistant A's Conversation with User|>", + "output_format": "[[{rating}]]", + "postprocessors": [ + "processors.extract_mt_bench_rating_judgment" + ] +} diff --git a/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_multi_turn_with_reference.json b/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_multi_turn_with_reference.json new file mode 100644 index 000000000..5fe2113f6 --- /dev/null +++ b/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_multi_turn_with_reference.json @@ -0,0 +1,27 @@ +{ + "type": "dialog_template", + "dialog_fields": [ + { + "type": "dialog_fields_data", + "dialog_field": "reference_dialog", + "assistant_role_label": "### Reference answer:", + "user_role_label": "### User:", + "system_role_label": "### System:" + }, + { + "type": "dialog_fields_data", + "dialog_field": "dialog", + "assistant_role_label": "### Assistant A:", + "user_role_label": "### User:", + "system_role_label": "### System:" + } + ], + "turns_separator": "\n\n", + "label_separator": "\n", + "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. You evaluation should focus on the assistant's answer to the second question. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n", + "input_format": "<|The Start of Reference Answer|>\n\n{reference_dialog}\n\n<|The End of Reference Answer|>\n\n\n<|The Start of Assistant A's Conversation with User|>\n\n{dialog}\n\n<|The End of Assistant A's Conversation with User|>", + "output_format": "[[{rating}]]", + "postprocessors": [ + "processors.extract_mt_bench_rating_judgment" + ] +} diff --git a/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_single_turn.json b/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_single_turn.json new file mode 100644 index 000000000..de436705a --- /dev/null +++ b/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_single_turn.json @@ -0,0 +1,9 @@ +{ + "type": "input_output_template", + "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n", + "input_format": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", + "output_format": "[[{rating}]]", + "postprocessors": [ + "processors.extract_mt_bench_rating_judgment" + ] +} diff --git a/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_single_turn_with_reference.json b/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_single_turn_with_reference.json new file mode 100644 index 000000000..37c5966a8 --- /dev/null +++ b/src/unitxt/catalog/templates/response_assessment/rating/mt_bench_single_turn_with_reference.json @@ -0,0 +1,9 @@ +{ + "type": "input_output_template", + "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n", + "input_format": "[Question]\n{question}\n\n[The Start of Reference Answer]\n{reference_answer}\n[The End of Reference Answer]\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", + "output_format": "[[{rating}]]", + "postprocessors": [ + "processors.extract_mt_bench_rating_judgment" + ] +} diff --git a/src/unitxt/inference.py b/src/unitxt/inference.py index f7d1a64d1..8d80681bb 100644 --- a/src/unitxt/inference.py +++ b/src/unitxt/inference.py @@ -1,7 +1,7 @@ import abc import os -from dataclasses import dataclass -from typing import List, Optional, Union +from dataclasses import field +from typing import Any, Dict, List, Literal, Optional, Union from .artifact import Artifact from .operator import PackageRequirementsMixin @@ -28,28 +28,72 @@ def _assert_allow_passing_data_to_remote_api(remote_api_label: str): class HFPipelineBasedInferenceEngine(InferenceEngine, PackageRequirementsMixin): model_name: str max_new_tokens: int + use_fp16: bool = True _requirement = { "transformers": "Install huggingface package using 'pip install --upgrade transformers" } def prepare(self): - from transformers import pipeline + import torch + from transformers import AutoConfig, pipeline - self.model = pipeline(model=self.model_name) + model_args: Dict[str, Any] = ( + {"torch_dtype": torch.float16} if self.use_fp16 else {} + ) + model_args.update({"max_new_tokens": self.max_new_tokens}) + + device = torch.device( + "mps" + if torch.backends.mps.is_available() + else 0 + if torch.cuda.is_available() + else "cpu" + ) + # We do this, because in some cases, using device:auto will offload some weights to the cpu + # (even though the model might *just* fit to a single gpu), even if there is a gpu available, and this will + # cause an error because the data is always on the gpu + if torch.cuda.device_count() > 1: + assert device == torch.device(0) + model_args.update({"device_map": "auto"}) + else: + model_args.update({"device": device}) + + task = ( + "text2text-generation" + if AutoConfig.from_pretrained( + self.model_name, trust_remote_code=True + ).is_encoder_decoder + else "text-generation" + ) + + if task == "text-generation": + model_args.update({"return_full_text": False}) + + self.model = pipeline( + model=self.model_name, trust_remote_code=True, **model_args + ) def infer(self, dataset): - return [ - output["generated_text"] - for output in self.model( - [instance["source"] for instance in dataset], - max_new_tokens=self.max_new_tokens, - ) - ] + outputs = [] + for output in self.model([instance["source"] for instance in dataset]): + if isinstance(output, list): + output = output[0] + outputs.append(output["generated_text"]) + return outputs -@dataclass() -class IbmGenAiInferenceEngineParams: - decoding_method: str = None +class MockInferenceEngine(InferenceEngine): + model_name: str + + def prepare(self): + return + + def infer(self, dataset): + return ["[[10]]" for instance in dataset] + + +class IbmGenAiInferenceEngineParams(Artifact): + decoding_method: Optional[Literal["greedy", "sample"]] = None max_new_tokens: Optional[int] = None min_new_tokens: Optional[int] = None random_seed: Optional[int] = None @@ -64,7 +108,9 @@ class IbmGenAiInferenceEngineParams: class IbmGenAiInferenceEngine(InferenceEngine, PackageRequirementsMixin): label: str = "ibm_genai" model_name: str - parameters: IbmGenAiInferenceEngineParams = IbmGenAiInferenceEngineParams() + parameters: IbmGenAiInferenceEngineParams = field( + default_factory=IbmGenAiInferenceEngineParams + ) _requirement = { "genai": "Install ibm-genai package using 'pip install --upgrade ibm-generative-ai" } @@ -87,7 +133,19 @@ def prepare(self): def infer(self, dataset): from genai.schema import TextGenerationParameters - genai_params = TextGenerationParameters(**self.parameters.__dict__) + genai_params = TextGenerationParameters( + max_new_tokens=self.parameters.max_new_tokens, + min_new_tokens=self.parameters.min_new_tokens, + random_seed=self.parameters.random_seed, + repetition_penalty=self.parameters.repetition_penalty, + stop_sequences=self.parameters.stop_sequences, + temperature=self.parameters.temperature, + top_p=self.parameters.top_p, + top_k=self.parameters.top_k, + typical_p=self.parameters.typical_p, + decoding_method=self.parameters.decoding_method, + ) + return list( self.client.text.generation.create( model_id=self.model_name, @@ -97,8 +155,7 @@ def infer(self, dataset): ) -@dataclass -class OpenAiInferenceEngineParams: +class OpenAiInferenceEngineParams(Artifact): frequency_penalty: Optional[float] = None presence_penalty: Optional[float] = None max_tokens: Optional[int] = None @@ -111,7 +168,9 @@ class OpenAiInferenceEngineParams: class OpenAiInferenceEngine(InferenceEngine, PackageRequirementsMixin): label: str = "openai" model_name: str - parameters: OpenAiInferenceEngineParams = OpenAiInferenceEngineParams() + parameters: OpenAiInferenceEngineParams = field( + default_factory=OpenAiInferenceEngineParams + ) _requirement = { "openai": "Install openai package using 'pip install --upgrade openai" } diff --git a/src/unitxt/llm_as_judge.py b/src/unitxt/llm_as_judge.py index 6fa14acdc..9bc5d8677 100644 --- a/src/unitxt/llm_as_judge.py +++ b/src/unitxt/llm_as_judge.py @@ -1,58 +1,138 @@ -from typing import Any, Dict, List +from typing import Any, Dict, List, Literal, Optional -import evaluate - -from .api import produce -from .inference import InferenceEngine +from .api import evaluate, produce +from .inference import InferenceEngine, OpenAiInferenceEngine from .metrics import BulkInstanceMetric +from .operator import SequentialOperator class LLMAsJudge(BulkInstanceMetric): """LLM as judge based metric class for evaluating correctness. Attributes: - main_score (str): The main score used for evaluation. + main_score (str): The main score label used for evaluation. + task (Literal["rating.single_turn"]): The type of task the llm-as-judge runs. This defines the output and input + format of the jude model. + template (str): The template used when generating inputs for the judge llm. + format (str): The format used when generating inputs for judge llm. + system_prompt (str): The system prompt used when generating inputs for judge llm. + strip_system_prompt_and_format_from_inputs (bool): Whether to strip the system prompt and formatting from the + inputs that the models that is being judges received, when they are inserted to the llm-as-judge prompt. + inference_model (InferenceEngine): the module that creates the inference of the judge llm. reduction_map (dict): A dictionary specifying the reduction method for the metric. - betch_size (int): The size of the bulk. - recipe (str): The unitxt recipe that will be used to create the judge dataset. - inference (InferenceEngine): the module that creates the inference. - - Methods: - prepare(self): Initialization method for the metric. - compute(self, references, predictions, additional_inputs): Method to compute the metric. - - Usage: - metric = LlamaIndexCorrectnessMetric() - scores = metric.compute(references, prediction, additional_inputs) + batch_size (int): The size of the bulk. """ main_score: str = "llm_as_judge" - reduction_map: Dict[str, List[str]] = None - batch_size: int = 32 - recipe: str + task: Literal["rating.single_turn", "single_turn_with_reference"] + template: str + format: Optional[str] = None + system_prompt: Optional[str] = None + strip_system_prompt_and_format_from_inputs: bool = True inference_model: InferenceEngine + reduction_map: Optional[Dict[str, List[str]]] = None + batch_size: int = 32 + + def _get_input_instances(self, task_data: List[Dict]) -> List: + if self.strip_system_prompt_and_format_from_inputs: + instances = [] + for task_data_instance in task_data: + template = task_data_instance["metadata"]["template"] + instance = SequentialOperator( + steps=[template, "formats.empty"] + ).process_instance( + {"inputs": task_data_instance, "outputs": task_data_instance} + ) + instances.append(instance["source"]) + """ + We also have access to: instance["target"] + instance["references"] + """ + return instances + return [t["source"] for t in task_data] + + def _get_instance_for_judge_model( + self, input_instances: List[str], predictions: List, references: List + ) -> List[Dict]: + if self.task == "rating.single_turn": + instances = [ + { + "question": input_instance, + "answer": prediction, + "rating": 5.0, # This is a dummy value that is not used in practice + } + for input_instance, prediction, reference in zip( + input_instances, predictions, references + ) + ] + elif self.task == "rating.single_turn_with_reference": + instances = [ + { + "question": input_instance, + "answer": prediction, + "reference_answer": reference, + "rating": 5.0, # This is a dummy value that is not used in practice + } + for input_instance, prediction, reference in zip( + input_instances, predictions, references + ) + ] + else: + raise NotImplementedError( + f"Error in 'LLMAsJudge' metric. {self.task} is not a supported task type." + ) + return instances def prepare(self): super().prepare() if self.reduction_map is None: self.reduction_map = {"mean": [self.main_score]} + supported_tasks = ["rating.single_turn", "rating.single_turn_with_reference"] + assert self.task in supported_tasks, ( + f"Error in 'LLMAsJudge' metric. {self.task} is not a supported task type." + f"The supported tasks types are: {', '.join(supported_tasks)}." + ) + + if isinstance(self.inference_model, OpenAiInferenceEngine): + if self.format: + raise ValueError( + "Error in 'LLMAsJudge' metric. Inference model 'OpenAiInferenceEngine' does " + "not support formatting. Please remove the format definition from the recipe" + " (OpenAi Chat API take care of the formatting automatically)." + ) + if self.system_prompt: + raise ValueError( + "Error in 'LLMAsJudge' metric. Inference model 'OpenAiInferenceEngine' does " + "not support system prompt. Please remove the system_prompt definition from the recipe" + " (Current implementation of Unitxt does not support this." + " Support will be added in future updates)." + ) + def compute( self, references: List[List[Any]], predictions: List[Any], task_data: List[Dict], ) -> List[Dict[str, Any]]: - instances = [ - { - **task_data_instance, - **{"model_output": prediction, "rating_label": "[[5]]"}, - } - for task_data_instance, prediction in zip(task_data, predictions) - ] - - dataset = produce(instances, self.recipe) + input_instances = self._get_input_instances(task_data) + instances = self._get_instance_for_judge_model( + input_instances, predictions, references + ) + + card = f"cards.dynamic_cards_for_llm_judges.{self.task}" + recipe = ( + f"card={card}," + f"template={self.template}," + "demos_pool_size=0," + "num_demos=0" + ) + if self.system_prompt: + recipe = f"{recipe},system_prompt={self.system_prompt}" + if self.format: + recipe = f"{recipe},format={self.format}" + + dataset = produce(instances, recipe) verdicts = self.inference_model.infer(dataset) - meta_metric = evaluate.load("unitxt/metric") - meta_scores = meta_metric.compute(predictions=verdicts, references=dataset) + meta_scores = evaluate(predictions=verdicts, data=dataset) return [{self.main_score: instance["prediction"]} for instance in meta_scores] diff --git a/src/unitxt/metric_utils.py b/src/unitxt/metric_utils.py index 20068fa27..a7f142811 100644 --- a/src/unitxt/metric_utils.py +++ b/src/unitxt/metric_utils.py @@ -15,6 +15,7 @@ from .operators import ( ApplyMetric, ApplyOperatorsField, + CopyFields, FlattenInstances, MergeStreams, SplitByNestedGroup, @@ -154,6 +155,11 @@ def prepare(self): self.steps = [ FromPredictionsAndOriginalData(), LoadJson(field="task_data"), + CopyFields( + field_to_field={ + "source": "task_data/source", + } + ), ApplyOperatorsField( operators_field="postprocessors", ), diff --git a/src/unitxt/operators.py b/src/unitxt/operators.py index b6c503f3a..636b00848 100644 --- a/src/unitxt/operators.py +++ b/src/unitxt/operators.py @@ -860,6 +860,51 @@ def process( return instance +class InterleaveListsToDialogOperator(StreamInstanceOperator): + """Interleaves two lists, one of user dialog turns and one of assistant dialog turns, into a single list of tuples, alternating between "user" and "assistant". + + The list of tuples if of format (role, turn_content), where the role label is specified by + the 'user_role_label' and 'assistant_role_label' fields (default to "user" and "assistant"). + + The user turns and assistant turns field are specified in the arguments. + The value of each of the 'fields' is assumed to be a list. + + """ + + user_turns_field: str + assistant_turns_field: str + user_role_label: str = "user" + assistant_role_label: str = "assistant" + to_field: str + + def process( + self, instance: Dict[str, Any], stream_name: Optional[str] = None + ) -> Dict[str, Any]: + user_turns = instance[self.user_turns_field] + assistant_turns = instance[self.assistant_turns_field] + + assert ( + len(user_turns) == len(assistant_turns) + or (len(user_turns) - len(assistant_turns) == 1) + ), "user_turns must have either the same length as assistant_turns or one more turn." + + interleaved_dialog = [] + i, j = 0, 0 # Indices for the user and assistant lists + # While either list has elements left, continue interleaving + while i < len(user_turns) or j < len(assistant_turns): + if i < len(user_turns): + interleaved_dialog.append((self.user_role_label, user_turns[i])) + i += 1 + if j < len(assistant_turns): + interleaved_dialog.append( + (self.assistant_role_label, assistant_turns[j]) + ) + j += 1 + + instance[self.to_field] = interleaved_dialog + return instance + + class IndexOf(StreamInstanceOperator): """For a given instance, finds the offset of value of field 'index_of', within the value of field 'search_in'.""" diff --git a/src/unitxt/processors.py b/src/unitxt/processors.py index caa691d30..d6d8c1506 100644 --- a/src/unitxt/processors.py +++ b/src/unitxt/processors.py @@ -218,7 +218,7 @@ def process_value(self, text: Any) -> Any: return text -class ExtractMtBenchJudgment(FieldOperator): +class ExtractMtBenchRatingJudgment(FieldOperator): def process_value(self, text: Any) -> Any: match = re.search(r"\[\[([\d]+\.?[\d]*)\]\]", text) try: @@ -227,6 +227,15 @@ def process_value(self, text: Any) -> Any: return 0.0 +class ExtractMtBenchLabelJudgment(FieldOperator): + def process_value(self, text: Any) -> Any: + match = re.search(r"\[\[([^\]]+)\]\]", text) + try: + return str(match.group(1)) + except: + return "None" + + class LiteralEval(FieldOperator): def process_value(self, text: Any) -> Any: if text is not None and not isinstance(text, str): diff --git a/src/unitxt/schema.py b/src/unitxt/schema.py index 227f547c4..38e16b206 100644 --- a/src/unitxt/schema.py +++ b/src/unitxt/schema.py @@ -34,16 +34,24 @@ class ToUnitxtGroup(StreamInstanceOperatorValidator): postprocessors: List[str] = field(default_factory=lambda: ["to_string_stripped"]) remove_unnecessary_fields: bool = True - def _to_lists_of_keys_and_values(self, dict: Dict[str, str]): - return { - "key": [key for key, _ in dict.items()], - "value": [str(value) for _, value in dict.items()], - } + @staticmethod + def artifact_to_jsonable(artifact): + if artifact.__id__ is None: + return artifact.to_dict() + return artifact.__id__ def process( self, instance: Dict[str, Any], stream_name: Optional[str] = None ) -> Dict[str, Any]: - task_data = {**instance["inputs"], **instance["outputs"]} + task_data = { + **instance["inputs"], + **instance["outputs"], + "metadata": { + "template": self.artifact_to_jsonable( + instance["recipe_metadata"]["template"] + ) + }, + } instance["task_data"] = json.dumps(task_data) if self.remove_unnecessary_fields: diff --git a/src/unitxt/standard.py b/src/unitxt/standard.py index 5f7cfba4d..b3f471a75 100644 --- a/src/unitxt/standard.py +++ b/src/unitxt/standard.py @@ -135,6 +135,7 @@ def set_pipelines(self): self.metadata, self.standardization, self.processing, + self.metadata, self.verblization, self.finalize, ] @@ -144,6 +145,7 @@ def set_pipelines(self): self.inference_instance.steps = [ self.metadata, self.processing, + self.metadata, ] self.inference_demos = SourceSequentialOperator() @@ -153,6 +155,7 @@ def set_pipelines(self): self.metadata, self.standardization, self.processing, + self.metadata, ] self.inference = SequentialOperator() diff --git a/src/unitxt/templates.py b/src/unitxt/templates.py index bd9e8e7e4..0ecb1270c 100644 --- a/src/unitxt/templates.py +++ b/src/unitxt/templates.py @@ -1,7 +1,9 @@ import json from abc import abstractmethod +from random import random from typing import Any, Dict, List, Optional, Tuple, Union +from .artifact import Artifact from .collections import ListCollection from .dataclass import NonPositionalField from .operator import StreamInstanceOperator @@ -48,6 +50,11 @@ def inputs_to_instruction_and_target_prefix(self, inputs): ) return instruction, target_prefix + def preprocess_inputs_and_outputs( + self, inputs: Dict[str, Any], outputs: Dict[str, Any] + ) -> Tuple[Dict[str, Any], Dict[str, Any]]: + return inputs, outputs + def process( self, instance: Dict[str, Any], stream_name: Optional[str] = None ) -> Dict[str, Any]: @@ -61,9 +68,9 @@ def process( inputs = instance.get("inputs") outputs = instance.get("outputs") + inputs, outputs = self.preprocess_inputs_and_outputs(inputs, outputs) self.set_titles(inputs) - source = self.inputs_to_source(inputs) instruction, target_prefix = self.inputs_to_instruction_and_target_prefix( inputs @@ -150,6 +157,135 @@ def outputs_to_target_and_references(self, outputs: Dict[str, object]) -> str: return target, [reference] +class PairwiseChoiceTemplate(InputOutputTemplate): + """PairwiseChoiceTemplate. + + Requirements: + The answer field value should be of type Literal["choice_a", "choice_b", "tie"] + + Args: + choice_a_field (str): The field which contains choice_a value + choice_b_field (str): The field which contains choice_b value + answer_field (str): The field which contains the answer value. + Should be of type Literal["choice_1", "choice_2", "tie"] + choice_a_label (str): The label of choice A answer as it is verbalized in the template. + choice_b_label (str): The label of choice B answer as it is verbalized in the template. + choice_tie_label (str): The label of a tie answer as it should be verbalized in the template. + shuffle (bool): whether to shuffle the choices or not. This is done to take into account position bias. + + shuffle: 50% of the time: + 1) The values of choice_a_field and choice_b_field will be swapped. + 2) If the values of answer_field is choice_a_label, set it to choice_b_label. + Else if the values of answer_field is choice_b_label, set it to choice_a_label. + Else if the value of answer_field is choice_tie_label, do nothing. + + """ + + choice_a_field: str + choice_b_field: str + answer_field: str + choice_a_label: str + choice_b_label: str + choice_tie_label: str + shuffle: bool + + def verbalize_answer_field(self, outputs: Dict[str, object]): + answer = outputs[self.answer_field] + assert answer in ["choice_a", "choice_b", "tie"] + if answer == "choice_a": + outputs[self.answer_field] = self.choice_a_label + elif answer == "choice_b": + outputs[self.answer_field] = self.choice_b_label + else: + outputs[self.answer_field] = self.choice_tie_label + + return outputs + + def shuffle_values(self, inputs: Dict[str, object], outputs: Dict[str, object]): + outcome = random() # A float between 0 and 1 + if outcome <= 0.5: + choice_a_value = inputs[self.choice_a_field] + choice_b_value = inputs[self.choice_b_field] + + inputs[self.choice_a_field] = choice_a_value + inputs[self.choice_b_field] = choice_b_value + + answer = outputs[self.answer_field] + assert answer in [ + self.choice_a_label, + self.choice_b_label, + self.choice_tie_label, + ] + if answer == self.choice_a_label: + outputs[self.answer_field] = self.choice_b_label + elif answer == self.choice_b_label: + outputs[self.answer_field] = self.choice_a_label + + return inputs, outputs + + def preprocess_inputs_and_outputs( + self, inputs: Dict[str, Any], outputs: Dict[str, Any] + ) -> Tuple[Dict[str, Any], Dict[str, Any]]: + outputs = self.verbalize_answer_field(outputs) + inputs, outputs = self.shuffle_values(inputs, outputs) + return inputs, outputs + + +class DialogFieldsData(Artifact): + user_role_label: str + assistant_role_label: str + system_role_label: str + dialog_field: str + + +class DialogTemplate(InputOutputTemplate): + dialog_fields: List[DialogFieldsData] + turns_separator: str = "\n\n" + label_separator: str = " " + + def process_dialog(self, inputs: Dict[str, object]): + for dialog_fields in self.dialog_fields: + dialog = inputs[dialog_fields.dialog_field] + # TODO: update isoftype method to support Literal verification and check + # it's List[Tuple[Literal["user", "assistant", "system"], str]] (Issue #799) + assert isoftype(dialog, List[Tuple[str, str]]) + + user_role_label = dialog_fields.user_role_label + assistant_role_label = dialog_fields.assistant_role_label + system_role_label = dialog_fields.system_role_label + + dialog_str = "" + for i, turn in enumerate(dialog): + (turn_type, turn_text) = turn + turns_separator = "" if i == 0 else self.turns_separator + if turn_type == "user": + dialog_str += f"{turns_separator}{user_role_label}{self.label_separator}{turn_text}" + elif turn_type == "assistant": + dialog_str += f"{turns_separator}{assistant_role_label}{self.label_separator}{turn_text}" + elif turn_type == "system": + dialog_str += f"{turns_separator}{system_role_label}{self.label_separator}{turn_text}" + + inputs[dialog_fields.dialog_field] = dialog_str + return inputs + + def preprocess_inputs_and_outputs( + self, inputs: Dict[str, Any], outputs: Dict[str, Any] + ) -> Tuple[Dict[str, Any], Dict[str, Any]]: + return self.process_dialog(inputs), outputs + + +class DialogPairwiseChoiceTemplate(DialogTemplate, PairwiseChoiceTemplate): + def preprocess_inputs_and_outputs( + self, inputs: Dict[str, Any], outputs: Dict[str, Any] + ) -> Tuple[Dict[str, Any], Dict[str, Any]]: + inputs, outputs = DialogTemplate.preprocess_inputs_and_outputs( + self, inputs, outputs + ) + return PairwiseChoiceTemplate.preprocess_inputs_and_outputs( + self, inputs, outputs + ) + + class MultipleChoiceTemplate(Template): """Formats the input (that specifies the question), the multiple choices to select the answer from, and specifies the field with the correct answer.""" diff --git a/tests/catalog/test_preparation.py b/tests/catalog/test_preparation.py index 0b874c56b..4edb4d6e4 100644 --- a/tests/catalog/test_preparation.py +++ b/tests/catalog/test_preparation.py @@ -4,6 +4,7 @@ import time from datetime import timedelta +from huggingface_hub.utils import GatedRepoError from unitxt.loaders import MissingKaggleCredentialsError from unitxt.logging_utils import get_logger from unitxt.settings_utils import get_constants @@ -52,9 +53,14 @@ def test_preparations(self): with self.subTest(file=file): try: import_module_from_file(file) - except MissingKaggleCredentialsError as e: + except (MissingKaggleCredentialsError, GatedRepoError) as e: logger.info(f"Skipping file {file} due to ignored error {e}") continue + except OSError as e: + if "You are trying to access a gated repo" in str(e): + logger.info(f"Skipping file {file} due to ignored error {e}") + continue + raise logger.info(f"Testing preparation file: {file} passed") self.assertTrue(True) diff --git a/tests/library/test_api.py b/tests/library/test_api.py index 0415032e6..f307f4e60 100644 --- a/tests/library/test_api.py +++ b/tests/library/test_api.py @@ -13,7 +13,13 @@ def test_load_dataset(self): "source": "Given this sentence: 'A plane is taking off.', on a scale of 1.0 to 5.0, what is the similarity to this text 'An air plane is taking off.'?\n", "target": "5.0", "references": ["5.0"], - "task_data": '{"text1": "A plane is taking off.", "text2": "An air plane is taking off.", "attribute_name": "similarity", "min_value": 1.0, "max_value": 5.0, "attribute_value": 5.0}', + "task_data": '{"text1": "A plane is taking off.", ' + '"text2": "An air plane is taking off.", ' + '"attribute_name": "similarity", ' + '"min_value": 1.0, ' + '"max_value": 5.0, ' + '"attribute_value": 5.0, ' + '"metadata": {"template": "templates.regression.two_texts.simple"}}', "group": "unitxt", "postprocessors": [ "processors.take_first_non_empty_line", @@ -62,7 +68,14 @@ def test_produce_with_recipe(self): "source": "Given a premise and hypothesis classify the entailment of the hypothesis to one of entailment, not entailment.premise: Steve follows Fred's example in everything. He influences him hugely., hypothesis: Steve influences him hugely.\nThe entailment class is entailment\n\npremise: The police arrested all of the gang members. They were trying to stop the drug trade in the neighborhood., hypothesis: The police were trying to stop the drug trade in the neighborhood.\nThe entailment class is not entailment\n\npremise: It works perfectly, hypothesis: It works!\nThe entailment class is ", "target": "?", "references": ["?"], - "task_data": '{"text_a": "It works perfectly", "text_a_type": "premise", "text_b": "It works!", "text_b_type": "hypothesis", "classes": ["entailment", "not entailment"], "type_of_relation": "entailment", "label": "?"}', + "task_data": '{"text_a": "It works perfectly", ' + '"text_a_type": "premise", ' + '"text_b": "It works!", ' + '"text_b_type": "hypothesis", ' + '"classes": ["entailment", "not entailment"], ' + '"type_of_relation": "entailment", ' + '"label": "?", ' + '"metadata": {"template": "templates.classification.multi_class.relation.default"}}', "group": "unitxt", "postprocessors": [ "processors.take_first_non_empty_line", @@ -93,7 +106,14 @@ def test_produce_with_recipe_with_list_of_instances(self): "source": "Given a premise and hypothesis classify the entailment of the hypothesis to one of entailment, not entailment.premise: Steve follows Fred's example in everything. He influences him hugely., hypothesis: Steve influences him hugely.\nThe entailment class is entailment\n\npremise: The police arrested all of the gang members. They were trying to stop the drug trade in the neighborhood., hypothesis: The police were trying to stop the drug trade in the neighborhood.\nThe entailment class is not entailment\n\npremise: It works perfectly, hypothesis: It works!\nThe entailment class is ", "target": "?", "references": ["?"], - "task_data": '{"text_a": "It works perfectly", "text_a_type": "premise", "text_b": "It works!", "text_b_type": "hypothesis", "classes": ["entailment", "not entailment"], "type_of_relation": "entailment", "label": "?"}', + "task_data": '{"text_a": "It works perfectly", ' + '"text_a_type": "premise", ' + '"text_b": "It works!", ' + '"text_b_type": "hypothesis", ' + '"classes": ["entailment", "not entailment"], ' + '"type_of_relation": "entailment", ' + '"label": "?", ' + '"metadata": {"template": "templates.classification.multi_class.relation.default"}}', "group": "unitxt", "postprocessors": [ "processors.take_first_non_empty_line", diff --git a/tests/library/test_metrics.py b/tests/library/test_metrics.py index a23d68642..320222925 100644 --- a/tests/library/test_metrics.py +++ b/tests/library/test_metrics.py @@ -1,6 +1,6 @@ from math import isnan -from unitxt.inference import HFPipelineBasedInferenceEngine +from unitxt.inference import MockInferenceEngine from unitxt.llm_as_judge import LLMAsJudge from unitxt.logging_utils import get_logger from unitxt.metrics import ( @@ -1297,70 +1297,37 @@ def _test_grouped_instance_confidence_interval( ) def test_llm_as_judge_metric(self): - inference_model = HFPipelineBasedInferenceEngine( - model_name="google/flan-t5-small", max_new_tokens=32 - ) - recipe = ( - "card=cards.rag.model_response_assessment.llm_as_judge_using_mt_bench_template," - "template=templates.rag.model_response_assessment.llm_as_judge_using_mt_bench_template," - "demos_pool_size=0," - "num_demos=0" - ) - - metric = LLMAsJudge(inference_model=inference_model, recipe=recipe) - - predictions = [ - "Meditation can help you remember things in a more meaningful way. It can also help you remember things that you didn't know.", - "Remove the fan from the fan and wipe it down with a damp cloth.", - "Place a small amount of rubbing alcohol on the bag and rub it over the smell.", - "Place the tank in the ground and place the guppy tank in the ground.", - "Use a hair dryer to remove the chemical burns.", - ] - references = [ - [ - "Meditation has been scientifically proven to increase focus and memory. You don't have to use any one meditation to help your memory. Using any meditation, such as mindfulness meditation, teaches you to focus your mind. When you're able to focus better, you're also better able to solidify concepts in your short-term memory. Therefore, practicing meditation can help you to develop your short-term memory.\n1. **Start today.** You may be surprised that you don't need to practice meditation for that long to start seeing the effects. One scientific study examined how a group of students responded to meditation. With just two weeks of meditation practice (10 minutes a day, plus 4 45-minute classes a week), the students significantly improved their GRE scores (a standardized test given to students trying to get into graduate school).\nIn fact, some studies show as little as four days of meditation can improve your attention span and memory.\n2. **Practice often.** Practicing every day is ideal. Doing so will help you work to increase your memory. In fact, spreading it out throughout the day can be helpful, such as meditating for 10 minutes in the morning, 10 minutes at lunch, and 10 minutes in the evening. However, if you find you can't practice every day, do it as often as you can.\n3. **Cultivate mindfulness.** Mindfulness is a part of meditation, but it's also something you can incorporate in your day-to-day life. Mindfulness, at its most basic, just means paying attention. In other words, place yourself in the moment, rather then letting your mind race elsewhere.\n\nFor instance, when you're in the shower, stop yourself from thinking about the day ahead. Instead, focus on what the shower feels like. Feel the heat of the water on your skin, how the soap feels against your body. Pay attention to the pleasant scent of your soap or shampoo. Let yourself really feel each sensation.\nYou can practice this technique anywhere. For instance, while you're washing dishes, take a moment to really focus on what you're doing. Let yourself feel the warm water on your skin, the weight of a plate in your hands. Put your full attention on getting the plate clean, making sure it's spotless.\n4. **Work your way up.** You may want to jump in with an hour-long meditation every day. However, most people can't sustain that kind of practice when they haven't meditated before. It's best to start small and work up to more time. You can start with as little as three minutes a day.\n5. **Pick a place to meditate.** Really, you can meditate anywhere, but it's good to choose a place that's not distracting, particularly when you're first starting out. Turn off the television, and move away from distractions. You can even set up a little meditation center in a corner of your house, with a candle and something you like to focus on.\n6. **Sit properly.** You can sit in a chair or on the floor. It's up to you. However, make sure you are relatively comfortable. You don't want a lot of pressure on one part of your body, for instance. Try to sit up straight, though not so much that it feels like a strain.\n7. **Get settled.** Spend a few minutes just bringing yourself into the right state of mind. Focus on the candle, if that helps. You don't have to be completely focused, but as you feel your mind wander, bring it back to the center, to the moment.\n8. **Focus on your breathing.** Once you've situated yourself, try paying attention to just your breathing. Focus on it going in and out. You don't have to change it up. Rather, just keep your attention on it, focusing all of yourself on breathing in and out. As your mind wanders, bring it back to your breath.\n9. **Keep bringing yourself back.** The longer you sit, the more likely your mind is to wander. That's okay. It's normal in fact. The important thing is to acknowledge that you've wandered and move back to your focus. Try labeling it when your mind wanders, such as saying \"thinking\" in your head, and then refocusing on your breath.\n10. **Try deep breathing.** One simple way to get started with meditation is to try deep breathing. Start by placing a hand on your chest and a hand on your stomach. When you breathe, you should notice your stomach expanding more than your chest, as you are trying to breathe as deeply as possible. It can help to close your eyes. Breathe in slowly through your nose. Hold the breath to the count of seven, then let it slowly out through your mouth to the count of eight (in your head).\n\nTry taking five deep breaths each time you try this practice.\nMake sure you are blowing out fully.\n11. **Consider taking a class.** While classes aren't for everyone, a class can jump start your meditation practice, making it easier for you to make it an everyday practice. Plus, if you have no idea where to begin, a class will help you figure out a good starting point.\n\nLook for meditation centers in your area. Some yoga studios offer meditation classes as well. Also, Buddhist temples or centers in your area will likely offer classes on meditation.\nYou may also find meditation classes through your library or your local parks and recreation department, and some churches offer meditation classes, particularly ones that embrace other traditions, such as the Unitarian Universalists.\n12. **Don't let distraction make you anxious.** Everyone gets distracted when they meditate. When you're first starting out, that may make you anxious or angry at yourself. However, rather than becoming angry, just try to be aware of when your thoughts are drifting, and pull them back to the meditation.\n13. **Realize even a little meditation can help.** That is, you may think you have to meditate every single day at a certain time for it to be helpful. However, if you fall into that thinking, you may find yourself giving up because you miss a few days. Keep in mind that even a little meditation can help improve your memory. Therefore, try to meditate when you can, even if you don't find time to do it every day.\n14. **Try a guided meditation.** If you don't want to take a class, you can still benefit from the wisdom of others. Try doing a guided meditation. You can find many online, or you can download free apps. The person on the other end will walk you through a meditation process, helping you to learn how to do it.\n15. **Change it up.** You don't have to meditate the same way every time. For instance, some people find a walking meditation helpful. Take a ten-minute walk, focusing on different sensations in turn. Start with feeling your body walking, really focusing on what the movements feel like. Move on to the feeling of breathing. After that, focus on what the air feels like on your skin, then try thinking about just what you see and then just what you hear." - ], - [ - "One of the most neglected spaces when it comes to cleaning a bathroom is the fan. Having a clean, functional fan can lessen bathroom odors, as well as combat mold and mildew growth. These issues can become a health hazard if left unattended for too long. By cleaning your fan around every 6 months, you will be able to remove built up dirt before it becomes a problem.\n1. **Turn off the power.** Before you do anything else, ensure that the fan is turned off and cannot turn back on until you are finished cleaning it. Most models will have a plug that is located directly behind the cover. You could remove the cover first and unplug the fan, but just to be safe, go and temporarily pull the breaker for your bathroom. The fan is now safe to work on.\n2. **Remove the cover.** Dust will fall when the cover is removed. To avoid the dust, position your stepladder such that you can reach the cover, but are not standing directly below it. Most covers will have 2 prongs on opposite sides holding it in place, others just need to be unscrewed. Remove the cover by pressing these prongs in or removing the screws, then set the cover aside.\n3. **Remove the fan.** Unscrew the assembly that is holding the fan in place, then very gently remove the fan. Be careful not to drop the fan or hit it on the side of the exhaust pipe as that could potentially chip the fan blades. Broken fan blades will cause the fan to be louder and less effective.\n4. **Clean the cover and fan.** Start by vacuuming off the majority of the built up grime on both the cover and the fan. Then dip a rag, preferably a microfiber cloth, in soapy water and use it to wipe up the remaining dust. Be as thorough as you can, you will probably not do this again for a while.\nYou can let the cover soak in a tub of hot soapy water, but the fan should be wiped by hand to avoid getting water on the motor assembly or plug.\n5. **Vacuum the exhaust pipe.** Use a crevice or brush attachment and vacuum off the inside of the exhaust pipe. If you can reach, also use your rag or cloth to wipe off what the vacuum could not get.\n6. **Vacuum the external exhaust port.** This can be done later once the entire process is finished, but at some point you should go outside and find the exterior vent for your bathroom fan. Depending on where the bathroom is located, this vent will either be on the roof or the side of your house. Bring a damp rag to wipe off any dirt that has built up on the other end of your exhaust pipe.\n7. **Wipe and vacuum the fan housing.** If your fan had an accessible plug, be careful not to get any water inside the outlet. Doing so could result in electrocution or short circuit the fan when you plug it back in. Therefore, use a dry rag to wipe off the fan housing, then vacuum up any remaining dust or debris.\n8. **Put the fan back in place.** Before reinstalling the fan, make sure that you cleaned off all the dust from in between each of the blades and dried it thoroughly. Carefully reinsert it into the exhaust pipe and screw the bracing back into place. Use your fingers and spin the fan around a few rotations to make sure that it is not rubbing against anything.\n9. **Turn the power back on.** Plug the fan back into the outlet and reset the breaker for your bathroom. The fan is now dangerous again, so do not touch it or continue to clean it after this point.\n10. **Reinstall the cover.** Once the cover has dried, either screw it back in or bend the prongs until the cover snaps back into place.\n11. **Test the fan.** Turn the fan on again to make sure everything works as normal. The fan should be quieter than it was before and provide a higher amount of air flow." - ], - [ - "Musty, stinky, odorous old leather bags aren't much fun and it's probable you're not keen to reuse such a bag. Before you resort to throwing it out, there are various ways that might just restore it to a respectable odor again.\n1. **Try a simple clean first.** If this clean doesn't shift the odor, you can try one of the other suggested methods after.\n\nWipe the leather bag inside and out with a clean, dry, soft cloth. This will pick up dust, loose debris and even some mold or mildew.\nWipe the leather bag down with a damp cloth. This will collect even more of the above items.\n2. **Allow the bag to air out.** Choose somewhere outdoors that is sheltered from direct light and heat, such as a table on the porch. Leave for a day if possible.\n3. **Check the odor.** If the bag still smells bad, choose one of the remaining suggested methods, or a combination of the methods.\n4. **Prepare a solution consisting of equal parts of white vinegar and distilled water.** Sponge the bag with the solution. Work on the inside of the bag and any mildewed outside part of the bag for a few minutes.\nIt's a good idea to test a small spot before trying this method, in case it stains.\n5. **Wipe off the vinegar solution with a clean, damp cloth.** \n6. **Allow to air dry.** Place the bag outside under shelter away from direct light to air dry.\n7. **Check the odor.** If it is still there, repeat. If not, the bag can be used again.\n8. **Use liquid detergent soap to clean the bag.** \n9. **Make a solution of soapy water, using the liquid detergent.** Dip the cleaning cloth or sponge in the solution and wring out before using.\n10. **Wipe the cloth over and inside the bag.** Concentrate in particular on the areas that you think are the smelliest.\n11. **Allow to air dry.** Place outside in a sheltered area away from direct sunlight and heat.\n12. **Once dry, check for the odor.** If it lingers, try again.\n13. **Use baking soda to deodorize the bag.** \n14. **Fill a clean sock with baking soda.** Tie off with a knot.\n15. **Place the leather bag and the baking soda-filled sock inside a large resealable plastic bag.** Alternatively, place both items inside an airtight container.\n16. **Set aside.** Let the baking soda work on the bag for at least 24 hours. The odors from the bag should transfer across to the baking soda.\n17. **Remove from the resealable bag or container.** Check the odor of the leather bag; if it still smells bad, repeat the process for another 24 hours, or longer. If it smells good again, throw away the baking soda, wash the sock and use the leather bag again.\n18. **Find some newspaper.** Scrunch the pages up and stuff them inside a large plastic bag, such as a kitchen waste bag or a garbage bag.\n19. **Slide the smelly leather bag in with the newspapers.** Arrange it so that it sits snugly in the middle of the papers.\n20. **Tie the bag up with a knot.** Alternatively, seal with a twist tie.\n21. **Let sit for at least 48 hours.** A few days more won't hurt it.\n22. **Remove from the bag.** Do a sniff test to see whether the odor has gone. If not, return to the bag for a few more days. Eventually it should start to smell better.\n23. **Fill a sock with coffee grounds.** They must be dry grounds, so if you're using grounds from your own coffee making, allow them to fully dry first. Or use the cheap instant coffee granules. Knot it off to keep the coffee intact.\n24. **Place the coffee sock inside your old leather bag.** Leave it there for up to a week. During this time, it should soak up much, if not all, of the cigarette smoke odor.\n25. **Do a smell test.** If all is good, the bag is ready for reuse. If it still smells a little, return the sock for a few more days.\n26. **Make or purchase some potpourri.** Place the potpourri inside a sachet.\n27. **Place the sachet inside the smelly bag.** Leave it there for at least one week.\n28. **Place the bag in an airy place.** Do not leave it in a dark cupboard; instead find somewhere with fresh air and indirect, cool light.\n29. **Check a week later.** It's a good idea to leave the sachet in the bag when using as well, as the scent will continue to improve the bag's own scent." - ], - ["Caring for guppies is relatively easy"], - [ - "Many people suffer from hair that is damaged or burnt by various harsh chemical" - ], - ] + model_id = "meta-llama/llama-3-8b-instruct" + format = "formats.llama3_chat" + task = "rating.single_turn" + template = "templates.response_assessment.rating.mt_bench_single_turn" + + inference_model = MockInferenceEngine(model_name=model_id) + model_label = model_id.split("/")[1].replace("-", "_") + model_label = f"{model_label}_ibm_genai" + template_label = template.split(".")[-1] + metric_label = f"{model_label}_template_{template_label}" + metric = LLMAsJudge( + inference_model=inference_model, + template=template, + task=task, + format=format, + main_score=metric_label, + ) + + predictions = ["[[10]]"] * 3 + references = [["[[10]]"], ["[[10]]"], ["[[10]]"]] task_data = [ { - "question": "How to Improve Your Memory Using Meditation", - "answers": [ - "Meditation has been scientifically proven to increase focus and memory. You don't have to use any one meditation to help your memory. Using any meditation, such as mindfulness meditation, teaches you to focus your mind. When you're able to focus better, you're also better able to solidify concepts in your short-term memory. Therefore, practicing meditation can help you to develop your short-term memory.\n1. **Start today.** You may be surprised that you don't need to practice meditation for that long to start seeing the effects. One scientific study examined how a group of students responded to meditation. With just two weeks of meditation practice (10 minutes a day, plus 4 45-minute classes a week), the students significantly improved their GRE scores (a standardized test given to students trying to get into graduate school).\nIn fact, some studies show as little as four days of meditation can improve your attention span and memory.\n2. **Practice often.** Practicing every day is ideal. Doing so will help you work to increase your memory. In fact, spreading it out throughout the day can be helpful, such as meditating for 10 minutes in the morning, 10 minutes at lunch, and 10 minutes in the evening. However, if you find you can't practice every day, do it as often as you can.\n3. **Cultivate mindfulness.** Mindfulness is a part of meditation, but it's also something you can incorporate in your day-to-day life. Mindfulness, at its most basic, just means paying attention. In other words, place yourself in the moment, rather then letting your mind race elsewhere.\n\nFor instance, when you're in the shower, stop yourself from thinking about the day ahead. Instead, focus on what the shower feels like. Feel the heat of the water on your skin, how the soap feels against your body. Pay attention to the pleasant scent of your soap or shampoo. Let yourself really feel each sensation.\nYou can practice this technique anywhere. For instance, while you're washing dishes, take a moment to really focus on what you're doing. Let yourself feel the warm water on your skin, the weight of a plate in your hands. Put your full attention on getting the plate clean, making sure it's spotless.\n4. **Work your way up.** You may want to jump in with an hour-long meditation every day. However, most people can't sustain that kind of practice when they haven't meditated before. It's best to start small and work up to more time. You can start with as little as three minutes a day.\n5. **Pick a place to meditate.** Really, you can meditate anywhere, but it's good to choose a place that's not distracting, particularly when you're first starting out. Turn off the television, and move away from distractions. You can even set up a little meditation center in a corner of your house, with a candle and something you like to focus on.\n6. **Sit properly.** You can sit in a chair or on the floor. It's up to you. However, make sure you are relatively comfortable. You don't want a lot of pressure on one part of your body, for instance. Try to sit up straight, though not so much that it feels like a strain.\n7. **Get settled.** Spend a few minutes just bringing yourself into the right state of mind. Focus on the candle, if that helps. You don't have to be completely focused, but as you feel your mind wander, bring it back to the center, to the moment.\n8. **Focus on your breathing.** Once you've situated yourself, try paying attention to just your breathing. Focus on it going in and out. You don't have to change it up. Rather, just keep your attention on it, focusing all of yourself on breathing in and out. As your mind wanders, bring it back to your breath.\n9. **Keep bringing yourself back.** The longer you sit, the more likely your mind is to wander. That's okay. It's normal in fact. The important thing is to acknowledge that you've wandered and move back to your focus. Try labeling it when your mind wanders, such as saying \"thinking\" in your head, and then refocusing on your breath.\n10. **Try deep breathing.** One simple way to get started with meditation is to try deep breathing. Start by placing a hand on your chest and a hand on your stomach. When you breathe, you should notice your stomach expanding more than your chest, as you are trying to breathe as deeply as possible. It can help to close your eyes. Breathe in slowly through your nose. Hold the breath to the count of seven, then let it slowly out through your mouth to the count of eight (in your head).\n\nTry taking five deep breaths each time you try this practice.\nMake sure you are blowing out fully.\n11. **Consider taking a class.** While classes aren't for everyone, a class can jump start your meditation practice, making it easier for you to make it an everyday practice. Plus, if you have no idea where to begin, a class will help you figure out a good starting point.\n\nLook for meditation centers in your area. Some yoga studios offer meditation classes as well. Also, Buddhist temples or centers in your area will likely offer classes on meditation.\nYou may also find meditation classes through your library or your local parks and recreation department, and some churches offer meditation classes, particularly ones that embrace other traditions, such as the Unitarian Universalists.\n12. **Don't let distraction make you anxious.** Everyone gets distracted when they meditate. When you're first starting out, that may make you anxious or angry at yourself. However, rather than becoming angry, just try to be aware of when your thoughts are drifting, and pull them back to the meditation.\n13. **Realize even a little meditation can help.** That is, you may think you have to meditate every single day at a certain time for it to be helpful. However, if you fall into that thinking, you may find yourself giving up because you miss a few days. Keep in mind that even a little meditation can help improve your memory. Therefore, try to meditate when you can, even if you don't find time to do it every day.\n14. **Try a guided meditation.** If you don't want to take a class, you can still benefit from the wisdom of others. Try doing a guided meditation. You can find many online, or you can download free apps. The person on the other end will walk you through a meditation process, helping you to learn how to do it.\n15. **Change it up.** You don't have to meditate the same way every time. For instance, some people find a walking meditation helpful. Take a ten-minute walk, focusing on different sensations in turn. Start with feeling your body walking, really focusing on what the movements feel like. Move on to the feeling of breathing. After that, focus on what the air feels like on your skin, then try thinking about just what you see and then just what you hear." - ], - }, - { - "question": "How to Clean a Bathroom Fan", - "answers": [ - "One of the most neglected spaces when it comes to cleaning a bathroom is the fan. Having a clean, functional fan can lessen bathroom odors, as well as combat mold and mildew growth. These issues can become a health hazard if left unattended for too long. By cleaning your fan around every 6 months, you will be able to remove built up dirt before it becomes a problem.\n1. **Turn off the power.** Before you do anything else, ensure that the fan is turned off and cannot turn back on until you are finished cleaning it. Most models will have a plug that is located directly behind the cover. You could remove the cover first and unplug the fan, but just to be safe, go and temporarily pull the breaker for your bathroom. The fan is now safe to work on.\n2. **Remove the cover.** Dust will fall when the cover is removed. To avoid the dust, position your stepladder such that you can reach the cover, but are not standing directly below it. Most covers will have 2 prongs on opposite sides holding it in place, others just need to be unscrewed. Remove the cover by pressing these prongs in or removing the screws, then set the cover aside.\n3. **Remove the fan.** Unscrew the assembly that is holding the fan in place, then very gently remove the fan. Be careful not to drop the fan or hit it on the side of the exhaust pipe as that could potentially chip the fan blades. Broken fan blades will cause the fan to be louder and less effective.\n4. **Clean the cover and fan.** Start by vacuuming off the majority of the built up grime on both the cover and the fan. Then dip a rag, preferably a microfiber cloth, in soapy water and use it to wipe up the remaining dust. Be as thorough as you can, you will probably not do this again for a while.\nYou can let the cover soak in a tub of hot soapy water, but the fan should be wiped by hand to avoid getting water on the motor assembly or plug.\n5. **Vacuum the exhaust pipe.** Use a crevice or brush attachment and vacuum off the inside of the exhaust pipe. If you can reach, also use your rag or cloth to wipe off what the vacuum could not get.\n6. **Vacuum the external exhaust port.** This can be done later once the entire process is finished, but at some point you should go outside and find the exterior vent for your bathroom fan. Depending on where the bathroom is located, this vent will either be on the roof or the side of your house. Bring a damp rag to wipe off any dirt that has built up on the other end of your exhaust pipe.\n7. **Wipe and vacuum the fan housing.** If your fan had an accessible plug, be careful not to get any water inside the outlet. Doing so could result in electrocution or short circuit the fan when you plug it back in. Therefore, use a dry rag to wipe off the fan housing, then vacuum up any remaining dust or debris.\n8. **Put the fan back in place.** Before reinstalling the fan, make sure that you cleaned off all the dust from in between each of the blades and dried it thoroughly. Carefully reinsert it into the exhaust pipe and screw the bracing back into place. Use your fingers and spin the fan around a few rotations to make sure that it is not rubbing against anything.\n9. **Turn the power back on.** Plug the fan back into the outlet and reset the breaker for your bathroom. The fan is now dangerous again, so do not touch it or continue to clean it after this point.\n10. **Reinstall the cover.** Once the cover has dried, either screw it back in or bend the prongs until the cover snaps back into place.\n11. **Test the fan.** Turn the fan on again to make sure everything works as normal. The fan should be quieter than it was before and provide a higher amount of air flow." - ], - }, - { - "question": "How to Remove Smell from an Old Leather Bag", - "answers": [ - "Musty, stinky, odorous old leather bags aren't much fun and it's probable you're not keen to reuse such a bag. Before you resort to throwing it out, there are various ways that might just restore it to a respectable odor again.\n1. **Try a simple clean first.** If this clean doesn't shift the odor, you can try one of the other suggested methods after.\n\nWipe the leather bag inside and out with a clean, dry, soft cloth. This will pick up dust, loose debris and even some mold or mildew.\nWipe the leather bag down with a damp cloth. This will collect even more of the above items.\n2. **Allow the bag to air out.** Choose somewhere outdoors that is sheltered from direct light and heat, such as a table on the porch. Leave for a day if possible.\n3. **Check the odor.** If the bag still smells bad, choose one of the remaining suggested methods, or a combination of the methods.\n4. **Prepare a solution consisting of equal parts of white vinegar and distilled water.** Sponge the bag with the solution. Work on the inside of the bag and any mildewed outside part of the bag for a few minutes.\nIt's a good idea to test a small spot before trying this method, in case it stains.\n5. **Wipe off the vinegar solution with a clean, damp cloth.** \n6. **Allow to air dry.** Place the bag outside under shelter away from direct light to air dry.\n7. **Check the odor.** If it is still there, repeat. If not, the bag can be used again.\n8. **Use liquid detergent soap to clean the bag.** \n9. **Make a solution of soapy water, using the liquid detergent.** Dip the cleaning cloth or sponge in the solution and wring out before using.\n10. **Wipe the cloth over and inside the bag.** Concentrate in particular on the areas that you think are the smelliest.\n11. **Allow to air dry.** Place outside in a sheltered area away from direct sunlight and heat.\n12. **Once dry, check for the odor.** If it lingers, try again.\n13. **Use baking soda to deodorize the bag.** \n14. **Fill a clean sock with baking soda.** Tie off with a knot.\n15. **Place the leather bag and the baking soda-filled sock inside a large resealable plastic bag.** Alternatively, place both items inside an airtight container.\n16. **Set aside.** Let the baking soda work on the bag for at least 24 hours. The odors from the bag should transfer across to the baking soda.\n17. **Remove from the resealable bag or container.** Check the odor of the leather bag; if it still smells bad, repeat the process for another 24 hours, or longer. If it smells good again, throw away the baking soda, wash the sock and use the leather bag again.\n18. **Find some newspaper.** Scrunch the pages up and stuff them inside a large plastic bag, such as a kitchen waste bag or a garbage bag.\n19. **Slide the smelly leather bag in with the newspapers.** Arrange it so that it sits snugly in the middle of the papers.\n20. **Tie the bag up with a knot.** Alternatively, seal with a twist tie.\n21. **Let sit for at least 48 hours.** A few days more won't hurt it.\n22. **Remove from the bag.** Do a sniff test to see whether the odor has gone. If not, return to the bag for a few more days. Eventually it should start to smell better.\n23. **Fill a sock with coffee grounds.** They must be dry grounds, so if you're using grounds from your own coffee making, allow them to fully dry first. Or use the cheap instant coffee granules. Knot it off to keep the coffee intact.\n24. **Place the coffee sock inside your old leather bag.** Leave it there for up to a week. During this time, it should soak up much, if not all, of the cigarette smoke odor.\n25. **Do a smell test.** If all is good, the bag is ready for reuse. If it still smells a little, return the sock for a few more days.\n26. **Make or purchase some potpourri.** Place the potpourri inside a sachet.\n27. **Place the sachet inside the smelly bag.** Leave it there for at least one week.\n28. **Place the bag in an airy place.** Do not leave it in a dark cupboard; instead find somewhere with fresh air and indirect, cool light.\n29. **Check a week later.** It's a good idea to leave the sachet in the bag when using as well, as the scent will continue to improve the bag's own scent." - ], - }, - { - "question": "How to Set up a Guppy Tank", - "answers": ["Caring for guppies is relatively easy"], - }, - { - "question": "How to Fix Chemically Burnt Hair", - "answers": [ - "Many people suffer from hair that is damaged or burnt by various harsh chemical " - ], - }, - ] + "input": "input", + "type_of_input": "type", + "output": "output", + "type_of_output": "type", + "source": "input", + "metadata": {"template": "templates.generation.default"}, + } + ] * 3 + outputs = apply_metric( metric=metric, predictions=predictions, @@ -1368,67 +1335,21 @@ def test_llm_as_judge_metric(self): task_data=task_data, ) actual_scores = [output["score"] for output in outputs] + instance_targets = [ + {metric_label: 1.0, "score_name": metric_label, "score": 1.0} + ] * 3 + global_target = { + metric_label: 1.0, + "score": 1.0, + "score_name": metric_label, + } + expected_scores = [ { - "global": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - "instance": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - }, - { - "global": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - "instance": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - }, - { - "global": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - "instance": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - }, - { - "global": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - "instance": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - }, - { - "global": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - "instance": { - "llm_as_judge": 0.1, - "score": 0.1, - "score_name": "llm_as_judge", - }, - }, + "global": global_target, + "instance": instance_target, + } + for instance_target in instance_targets ] self.assertListEqual(actual_scores, expected_scores) diff --git a/tests/library/test_postprocessors.py b/tests/library/test_postprocessors.py index b99708e4d..aa4f29d82 100644 --- a/tests/library/test_postprocessors.py +++ b/tests/library/test_postprocessors.py @@ -355,8 +355,8 @@ def test_span_labeling_json_template_errors(self): tester=self, ) - def test_extract_mt_bench_judgment(self): - postprocessor, _ = fetch_artifact("processors.extract_mt_bench_judgment") + def test_extract_mt_bench_rating_judgment(self): + postprocessor, _ = fetch_artifact("processors.extract_mt_bench_rating_judgment") predictions = [ "no reason 3.14 [[3]]", "[[6]]", @@ -374,6 +374,24 @@ def test_extract_mt_bench_judgment(self): tester=self, ) + def test_extract_mt_bench_label_judgment(self): + postprocessor, _ = fetch_artifact("processors.extract_mt_bench_label_judgment") + predictions = [ + "no reason 3.14 [[A]]", + "[[B]]", + "[[A]] because", + "good", + "bad [[C]]", + ] + targets = ["A", "B", "A", "None", "C"] + + check_operator( + operator=postprocessor, + inputs=list_to_stream_with_prediction_and_references(predictions), + targets=list_to_stream_with_prediction_and_references(targets), + tester=self, + ) + def test_literal_eval(self): parser, _ = fetch_artifact("processors.literal_eval") inputs = [ diff --git a/tests/library/test_recipe.py b/tests/library/test_recipe.py index 4210f5816..0b227a9c6 100644 --- a/tests/library/test_recipe.py +++ b/tests/library/test_recipe.py @@ -88,7 +88,13 @@ def test_standard_recipe_production_without_demos(self): "source": "<>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<>\n\n\n\n\n\nUser:The following are multiple choice questions (with answers) about testing.\n\nwhat?\nA. yes\nB. not\nC. maybe\nAnswer:\nAgent:", "target": " C", "references": [" C"], - "task_data": '{"topic": "testing", "question": "what?", "choices": ["yes", "not", "maybe"], "answer": "maybe", "options": [" A", " B", " C"]}', + "task_data": '{"topic": "testing", ' + '"question": "what?", ' + '"choices": ["yes", "not", "maybe"], ' + '"answer": "maybe", ' + '"options": [" A", " B", " C"], ' + '"metadata": {"template": "templates.qa.multiple_choice.with_topic.lm_eval_harness"}' + "}", "group": "unitxt", "postprocessors": ["processors.first_character"], } @@ -154,7 +160,13 @@ def test_standard_recipe_production_with_demos(self): "source": "<>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<>\n\n\n\n\nUser: The following are multiple choice questions (with answers) about marketing.\n\nThe single group within society that is most vulnerable to reference group influence is:\nA. The older consumer who feels somewhat left out of things.\nB. The married women, many of whom feel a need for stability in their lives.\nC. New immigrants who really want to assimilate into their new culture.\nD. Children, who base most of their buying decisions on outside influences.\nAnswer:\nAgent: D\n\nUser: The following are multiple choice questions (with answers) about marketing.\n\n Which of the following is an assumption in Maslow's hierarchy of needs?\nA. Needs are dependent on culture and also on social class.\nB. Lower-level needs must be at least partially satisfied before higher needs can affect behaviour.\nC. Needs are not prioritized or arranged in any particular order.\nD. Satisfied needs are motivators, and new needs emerge when current needs remain unmet.\nAnswer:\nAgent: B\n\nUser: The following are multiple choice questions (with answers) about marketing.\n\nIn an organization, the group of people tasked with buying decisions is referred to as the _______________.\nA. Outsourcing unit.\nB. Procurement centre.\nC. Chief executive unit.\nD. Decision-making unit.\nAnswer:\nAgent: D\n\n\nUser:The following are multiple choice questions (with answers) about testing.\n\nwhat?\nA. yes\nB. not\nC. maybe\nAnswer:\nAgent:", "target": " C", "references": [" C"], - "task_data": '{"topic": "testing", "question": "what?", "choices": ["yes", "not", "maybe"], "answer": "maybe", "options": [" A", " B", " C"]}', + "task_data": '{"topic": "testing",' + ' "question": "what?",' + ' "choices": ["yes", "not", "maybe"],' + ' "answer": "maybe",' + ' "options": [" A", " B", " C"],' + ' "metadata": {"template": "templates.qa.multiple_choice.with_topic.lm_eval_harness"}' + "}", "group": "unitxt", "postprocessors": ["processors.first_character"], } diff --git a/tests/utils.py b/tests/utils.py index 1b6558fa3..b8f2bda13 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -27,7 +27,7 @@ def setUpClass(cls): enable_explicit_format() unitxt.settings.allow_unverified_code = True unitxt.settings.use_only_local_catalogs = True - unitxt.settings.global_loader_limit = 300 + # unitxt.settings.global_loader_limit = 300 unitxt.settings.max_log_message_size = 1000 register_local_catalog_for_tests() cls.maxDiff = None