mosaicml · bmosaicml · Nov 21, 2023 · Aug 11, 2023 · Aug 18, 2023 · Aug 18, 2023
@@ -1,6 +1,6 @@
 # In-context learning (ICL) evaluation
 
-This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Model Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/eval_gauntlet.md) organized into 6 broad categories of competency that we expect good foundation models to have.
+This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Eval Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/eval_gauntlet.md) organized into 6 broad categories of competency that we expect good foundation models to have.
 
 You can evaluate a model by preparing an evaluation YAML following the format of the examples in the [`scripts/eval/yamls` directory](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval/yamls).
 

@@ -1,4 +1,4 @@
-# Mosaic Model Gauntlet v0 - Evaluation Suite
+# Mosaic Eval Gauntlet v0.1.0 - Evaluation Suite
 
 
 <!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_BEGIN -->
@@ -7,9 +7,9 @@
       <img alt="LLM Foundry" src="../../../assets/radar_blog.png" width="60%">
     </picture>
     <br>
-    MPT-7B vs MPT-30B compared on the 6 categories of Model Gauntlet.
+    MPT-7B vs MPT-30B compared on the 6 categories of Eval Gauntlet v0.
 </p>
-The Mosaic Model Gauntlet is MosaicML’s new technique for evaluating the quality of pretrained foundation models. The Model Gauntlet encompasses 35 different benchmarks collected from a variety of sources, and organized into 6 broad categories of competency that we expect good foundation models to have. We compiled the categories after an extensive review of existing LLM publications, and open source evaluation harnesses such as EleutherAI Eval Harness and Stanford CRFM’s HELM.
+The Mosaic Eval Gauntlet is MosaicML’s new technique for evaluating the quality of pretrained foundation models. The Eval Gauntlet encompasses 35 different benchmarks collected from a variety of sources, and organized into 6 broad categories of competency that we expect good foundation models to have. We compiled the categories after an extensive review of existing LLM publications, and open source evaluation harnesses such as EleutherAI Eval Harness and Stanford CRFM’s HELM.
 
 <br>
 While deciding which benchmarks to include, we had a few criteria in mind. We wanted benchmarks to require a broad range of skills that were useful for practical applications, we wanted them to come from a diverse range of sources, we wanted them to capture skills that have been traditionally emphasized by the research community as well as those that have been underexplored, and we wanted them to be evaluated via simple, unambiguous metrics such as exact match and multiple choice accuracy. The philosophy behind compiling aggregate scores as opposed to the more common approach of reporting individual metrics, is two-fold.
@@ -24,7 +24,7 @@ At evaluation time, we run all the benchmarks, average the subscores within each
 
 For example, if benchmark A has a random baseline accuracy of 25%, and the model achieved 30%, we would report this as (0.3 - 0.25)/(1-0.25) = 0.0667. This can be thought of as the accuracy above chance rescaled so the max is 1. For benchmarks in which the random guessing baseline accuracy is ~0 we report the accuracy as is. Note that with this rescaling, a model could technically score below 0 on a category as a whole, but we haven’t found this to occur with any of the models we’ve tested.
 
-This is version v0, in the coming weeks we will update the mixture to include more benchmarks.
+This is version v0.1.0 of the Eval Gauntlet.
 
 ### Reading Comprehension
 
@@ -349,7 +349,7 @@ The Safety category consists of benchmarks designed to assess model's toxicity,
     - Random baseline accuracy: 50%
 
 ### Programming
-Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more.
+Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more. By default the programming tasks are disabled in `scripts/eval/yamls/tasks.yaml` due to their long duration.
 
 51. HumanEval Python code generation
     - Description: HumanEval Python consists of 164 python programming challenges, in which the model is presented with the method signature and docstring comment for a python program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs.

@@ -0,0 +1,5 @@
+-
+  label: copa
+  dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
+  num_fewshot: [0]
+  icl_task_type: multiple_choice
@@ -1,69 +1,69 @@
 icl_tasks:
 -
   label: jeopardy
-  dataset_uri: eval/local_data/world_knowledge/jeopardy_all.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/jeopardy_all.jsonl
   num_fewshot: [10]
   icl_task_type: language_modeling
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
   has_categories: true
 -
   label: bigbench_qa_wikidata
-  dataset_uri: eval/local_data/world_knowledge/bigbench_qa_wikidata.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/bigbench_qa_wikidata.jsonl
   num_fewshot: [10]
   icl_task_type: language_modeling
 -
   label: arc_easy
-  dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 -
   label: arc_challenge
-  dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 -
   label: mmlu
-  dataset_uri: eval/local_data/world_knowledge/mmlu.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/mmlu.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
   has_categories: true
 -
   label: bigbench_misconceptions
-  dataset_uri: eval/local_data/world_knowledge/bigbench_misconceptions.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/bigbench_misconceptions.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: copa
-  dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
   num_fewshot: [0]
   icl_task_type: multiple_choice
 -
   label: piqa
-  dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl  # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 -
   label: openbook_qa
-  dataset_uri: eval/local_data/commonsense_reasoning/openbook_qa.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/openbook_qa.jsonl
   num_fewshot: [0]
   icl_task_type: multiple_choice
 -
   label: bigbench_novel_concepts
-  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_novel_concepts.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_novel_concepts.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: bigbench_strange_stories
-  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strange_stories.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strange_stories.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: bigbench_strategy_qa
-  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strategy_qa.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strategy_qa.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
@@ -73,17 +73,17 @@ icl_tasks:
   icl_task_type: language_modeling
 -
   label: hellaswag
-  dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: winograd
-  dataset_uri: eval/local_data/language_understanding/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/language_understanding/winograd_wsc.jsonl
   num_fewshot: [0]
   icl_task_type: schema
 -
   label: winogrande
-  dataset_uri: eval/local_data/language_understanding/winogrande.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/language_understanding/winogrande.jsonl
   num_fewshot: [0]
   icl_task_type: schema
 -
@@ -154,84 +154,84 @@ icl_tasks:
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 -
   label: pubmed_qa_labeled
-  dataset_uri: eval/local_data/reading_comprehension/pubmed_qa_labeled.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/reading_comprehension/pubmed_qa_labeled.jsonl
   num_fewshot: [10]
   icl_task_type: language_modeling
 -
   label: squad
-  dataset_uri: eval/local_data/reading_comprehension/squad.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/reading_comprehension/squad.jsonl
   num_fewshot: [10]
   icl_task_type: language_modeling
 -
   label: bigbench_understanding_fables
-  dataset_uri: eval/local_data/reading_comprehension/bigbench_understanding_fables.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/reading_comprehension/bigbench_understanding_fables.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: boolq
-  dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 # -
 #   label: human_eval
-#   dataset_uri: eval/local_data/programming/human_eval.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_cpp
-#   dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_js
-#   dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_return_simple
-#   dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_return_complex
-#   dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_25
-#   dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_50
-#   dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20
 #   batch_size: 1
 #   icl_task_type: code_evaluation
 # -
 #   label: human_eval_75
-#   dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl # ADD YOUR OWN DATASET URI
+#   dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl
 #   num_fewshot: [0]
 #   pass_at_k: 1
 #   num_beams: 20