Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate ICL classes to foundry #936

Merged
merged 100 commits into from
Apr 12, 2024
Merged

Conversation

bmosaicml
Copy link
Contributor

@bmosaicml bmosaicml commented Feb 2, 2024

DEPRECATING COMPOSER CLASSES: mosaicml/composer#3125

This PR migrates all the ICL(Dataset|Metric) classes (including the super classes, since composer no longer depends on them) It also migrates all the relevant tests, it renames the QATask to InContextLearningGenerationTaskWithAnswers (to capture the fact that it can and will be used for arbitrary generation tasks, such as summarization, and can even be used with LLM-as-judge).

Relatedly we need to remove or deprecate the equivalent classes in composer in order to avoid confusion and prevent people from trying to add new functionality to composer in the future.

Experimental runs:

mpt 7b: mpt-eval-zDGaOU
Llama 2 7b: llama2-eval-66Rw1B

| model_name               |   core_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |
|:-------------------------|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|
| mosaicml/mpt-7b |       0.343081 |          0.421662 |                0.256372 |                 0.634086 |                   0.155426 |                0.247861 |
| meta-llama/Llama-2-7b-hf |       0.417207 |          0.510394 |                0.355677 |                 0.655805 |                   0.227892 |                0.336269 |
| Category                 | Benchmark                    | Subtask                             |   Accuracy | Number few shot   | Model                    |
|:-------------------------|:-----------------------------|:------------------------------------|-----------:|:------------------|:-------------------------|
| symbolic_problem_solving | gsm8k                        |                                     |  0.0871873 | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | copa                         |                                     |  0.8       | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | commonsense_qa               |                                     |  0.225225  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | piqa                         |                                     |  0.799238  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |  0.568965  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |  0.561817  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | lambada_openai               |                                     |  0.702892  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.761601  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | coqa                         |                                     |  0.453213  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | boolq                        |                                     |  0.747401  | 0-shot            | mosaicml/mpt-7b |
| world_knowledge          | triviaqa_sm_sub              |                                     |  0.493667  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | jeopardy                     | Average                             |  0.459835  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | american_history                    |  0.513317  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | literature                          |  0.557143  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | science                             |  0.386555  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | word_origins                        |  0.265753  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_history                       |  0.576407  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | bigbench_qa_wikidata         |                                     |  0.655824  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_easy                     |                                     |  0.718855  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.440273  | 3-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | siqa                         |                                     |  0.54913   | 3-shot            | mosaicml/mpt-7b |
| language_understanding   | winograd                     |                                     |  0.85348   | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_operators           |                                     |  0.333333  | 3-shot            | mosaicml/mpt-7b |
| reading_comprehension    | squad                        |                                     |  0.553264  | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | svamp                        |                                     |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | mmlu                         | Average                             |  0.281358  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | abstract_algebra                    |  0.26      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | anatomy                             |  0.303704  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | astronomy                           |  0.309211  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | business_ethics                     |  0.38      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | clinical_knowledge                  |  0.286792  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_biology                     |  0.291667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_chemistry                   |  0.21      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_computer_science            |  0.25      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_mathematics                 |  0.31      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_medicine                    |  0.225434  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_physics                     |  0.215686  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | computer_security                   |  0.35      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | conceptual_physics                  |  0.289362  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | econometrics                        |  0.245614  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | electrical_engineering              |  0.324138  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | elementary_mathematics              |  0.272487  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | formal_logic                        |  0.222222  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | global_facts                        |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_biology                 |  0.3       | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_chemistry               |  0.187192  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_computer_science        |  0.34      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_european_history        |  0.321212  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_geography               |  0.313131  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_government_and_politics |  0.264249  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_macroeconomics          |  0.266667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_mathematics             |  0.211111  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_microeconomics          |  0.247899  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_physics                 |  0.291391  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_psychology              |  0.251376  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_statistics              |  0.208333  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_us_history              |  0.181373  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_world_history           |  0.253165  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_aging                         |  0.403587  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_sexuality                     |  0.259542  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | international_law                   |  0.347107  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | jurisprudence                       |  0.324074  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | logical_fallacies                   |  0.251534  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | machine_learning                    |  0.321429  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | management                          |  0.242718  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | marketing                           |  0.299145  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | medical_genetics                    |  0.22      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | miscellaneous                       |  0.301405  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_disputes                      |  0.32659   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_scenarios                     |  0.259218  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | nutrition                           |  0.30719   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | philosophy                          |  0.315113  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | prehistory                          |  0.302469  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_accounting             |  0.248227  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_law                    |  0.269231  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_medicine               |  0.198529  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_psychology             |  0.271242  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | public_relations                    |  0.381818  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | security_studies                    |  0.236735  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | sociology                           |  0.268657  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | us_foreign_policy                   |  0.36      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | virology                            |  0.349398  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_religions                     |  0.269006  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |  0.304     | 5-shot            | mosaicml/mpt-7b |
| language_understanding   | winogrande                   |                                     |  0.722178  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |  0.23913   | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |  0.082     | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |  0.089     | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |  0.235075  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |  0.247059  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_sat_en              |                                     |  0.257282  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.4343    | 25-shot           | mosaicml/mpt-7b |
| commonsense_reasoning    | openbook_qa                  |                                     |  0.452     | 10-shot           | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.765385  | 10-shot           | mosaicml/mpt-7b |
|                          | bigbench_cs_algorithms       |                                     |  0.480303  | 10-shot           | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |  0.281787  | 1-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | gsm8k                        |                                     |   0.148597 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | copa                         |                                     |   0.8      | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | commonsense_qa               |                                     |   0.383292 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | piqa                         |                                     |   0.786181 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |   0.614943 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |   0.585408 | 0-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | lambada_openai               |                                     |   0.736658 | 0-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | hellaswag                    |                                     |   0.74995  | 0-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | coqa                         |                                     |   0.4705   | 0-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | boolq                        |                                     |   0.792966 | 0-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | triviaqa_sm_sub              |                                     |   0.582333 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | jeopardy                     | Average                             |   0.508028 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | american_history                    |   0.564165 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | literature                          |   0.661224 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | science                             |   0.388655 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | word_origins                        |   0.30411  | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | world_history                       |   0.621984 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | bigbench_qa_wikidata         |                                     |   0.693125 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_easy                     |                                     |   0.757155 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_challenge                |                                     |   0.494881 | 3-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | siqa                         |                                     |   0.730809 | 3-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | winograd                     |                                     |   0.879121 | 3-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_operators           |                                     |   0.42381  | 3-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | squad                        |                                     |   0.532545 | 3-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | svamp                        |                                     |   0.423333 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | mmlu                         | Average                             |   0.457122 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | abstract_algebra                    |   0.31     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | anatomy                             |   0.422222 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | astronomy                           |   0.460526 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | business_ethics                     |   0.48     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | clinical_knowledge                  |   0.418868 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_biology                     |   0.416667 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_chemistry                   |   0.28     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_computer_science            |   0.29     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_mathematics                 |   0.34     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_medicine                    |   0.421965 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_physics                     |   0.264706 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | computer_security                   |   0.56     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | conceptual_physics                  |   0.434043 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | econometrics                        |   0.307018 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | electrical_engineering              |   0.427586 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | elementary_mathematics              |   0.285714 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | formal_logic                        |   0.325397 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | global_facts                        |   0.41     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_biology                 |   0.512903 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_chemistry               |   0.349754 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_computer_science        |   0.45     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_european_history        |   0.606061 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_geography               |   0.520202 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_government_and_politics |   0.668394 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_macroeconomics          |   0.407692 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_mathematics             |   0.27037  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_microeconomics          |   0.403361 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_physics                 |   0.258278 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_psychology              |   0.592661 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_statistics              |   0.222222 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_us_history              |   0.578431 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_world_history           |   0.561181 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | human_aging                         |   0.565022 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | human_sexuality                     |   0.564885 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | international_law                   |   0.636364 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | jurisprudence                       |   0.546296 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | logical_fallacies                   |   0.521472 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | machine_learning                    |   0.339286 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | management                          |   0.514563 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | marketing                           |   0.67094  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | medical_genetics                    |   0.52     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | miscellaneous                       |   0.630907 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | moral_disputes                      |   0.523121 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | moral_scenarios                     |   0.250279 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | nutrition                           |   0.486928 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | philosophy                          |   0.553055 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | prehistory                          |   0.506173 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_accounting             |   0.368794 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_law                    |   0.34485  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_medicine               |   0.422794 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_psychology             |   0.45098  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | public_relations                    |   0.490909 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | security_studies                    |   0.420408 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | sociology                           |   0.666667 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | us_foreign_policy                   |   0.68     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | virology                            |   0.475904 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | world_religions                     |   0.649123 | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |   0.291    | 5-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | winogrande                   |                                     |   0.73086  | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |   0.252174 | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |   0.245    | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |   0.256    | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |   0.373134 | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |   0.329412 | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_sat_en              |                                     |   0.368932 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_challenge                |                                     |   0.514505 | 25-shot           | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | openbook_qa                  |                                     |   0.458    | 10-shot           | meta-llama/Llama-2-7b-hf |
| language_understanding   | hellaswag                    |                                     |   0.773053 | 10-shot           | meta-llama/Llama-2-7b-hf |
|                          | bigbench_cs_algorithms       |                                     |   0.44697  | 10-shot           | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |   0.274371 | 1-shot            | meta-llama/Llama-2-7b-hf |

Base run: eval-gauntlet-pre-migration-mpt-N3lIuF
Base llama2: eval-gauntlet-pre-migration-llama-imFgAZ

| model_name               |   core_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |
|:-------------------------|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|
| meta-llama/Llama-2-7b-hf |       0.417124 |          0.510394 |                0.355677 |                 0.655805 |                   0.227475 |                0.336269 |
| mosaicml/mpt-7b |       0.343081 |          0.421662 |                0.256372 |                 0.634086 |                   0.155426 |                0.247861 |
| Category                 | Benchmark                    | Subtask                             |   Accuracy | Number few shot   | Model                    |
|:-------------------------|:-----------------------------|:------------------------------------|-----------:|:------------------|:-------------------------|
| symbolic_problem_solving | gsm8k                        |                                     |  0.0871873 | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | copa                         |                                     |  0.8       | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | commonsense_qa               |                                     |  0.225225  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | piqa                         |                                     |  0.799238  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |  0.568965  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |  0.561817  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | lambada_openai               |                                     |  0.702892  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.761601  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | coqa                         |                                     |  0.453213  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | boolq                        |                                     |  0.747401  | 0-shot            | mosaicml/mpt-7b |
| world_knowledge          | triviaqa_sm_sub              |                                     |  0.493667  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | jeopardy                     | Average                             |  0.459835  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | american_history                    |  0.513317  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | literature                          |  0.557143  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | science                             |  0.386555  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | word_origins                        |  0.265753  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_history                       |  0.576407  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | bigbench_qa_wikidata         |                                     |  0.655824  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_easy                     |                                     |  0.718855  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.440273  | 3-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | siqa                         |                                     |  0.54913   | 3-shot            | mosaicml/mpt-7b |
| language_understanding   | winograd                     |                                     |  0.85348   | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_operators           |                                     |  0.333333  | 3-shot            | mosaicml/mpt-7b |
| reading_comprehension    | squad                        |                                     |  0.553264  | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | svamp                        |                                     |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | mmlu                         | Average                             |  0.281358  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | abstract_algebra                    |  0.26      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | anatomy                             |  0.303704  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | astronomy                           |  0.309211  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | business_ethics                     |  0.38      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | clinical_knowledge                  |  0.286792  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_biology                     |  0.291667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_chemistry                   |  0.21      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_computer_science            |  0.25      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_mathematics                 |  0.31      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_medicine                    |  0.225434  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_physics                     |  0.215686  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | computer_security                   |  0.35      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | conceptual_physics                  |  0.289362  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | econometrics                        |  0.245614  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | electrical_engineering              |  0.324138  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | elementary_mathematics              |  0.272487  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | formal_logic                        |  0.222222  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | global_facts                        |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_biology                 |  0.3       | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_chemistry               |  0.187192  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_computer_science        |  0.34      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_european_history        |  0.321212  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_geography               |  0.313131  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_government_and_politics |  0.264249  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_macroeconomics          |  0.266667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_mathematics             |  0.211111  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_microeconomics          |  0.247899  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_physics                 |  0.291391  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_psychology              |  0.251376  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_statistics              |  0.208333  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_us_history              |  0.181373  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_world_history           |  0.253165  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_aging                         |  0.403587  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_sexuality                     |  0.259542  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | international_law                   |  0.347107  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | jurisprudence                       |  0.324074  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | logical_fallacies                   |  0.251534  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | machine_learning                    |  0.321429  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | management                          |  0.242718  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | marketing                           |  0.299145  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | medical_genetics                    |  0.22      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | miscellaneous                       |  0.301405  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_disputes                      |  0.32659   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_scenarios                     |  0.259218  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | nutrition                           |  0.30719   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | philosophy                          |  0.315113  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | prehistory                          |  0.302469  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_accounting             |  0.248227  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_law                    |  0.269231  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_medicine               |  0.198529  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_psychology             |  0.271242  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | public_relations                    |  0.381818  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | security_studies                    |  0.236735  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | sociology                           |  0.268657  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | us_foreign_policy                   |  0.36      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | virology                            |  0.349398  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_religions                     |  0.269006  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |  0.304     | 5-shot            | mosaicml/mpt-7b |
| language_understanding   | winogrande                   |                                     |  0.722178  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |  0.23913   | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |  0.082     | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |  0.089     | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |  0.235075  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |  0.247059  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_sat_en              |                                     |  0.257282  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.4343    | 25-shot           | mosaicml/mpt-7b |
| commonsense_reasoning    | openbook_qa                  |                                     |  0.452     | 10-shot           | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.765385  | 10-shot           | mosaicml/mpt-7b |
|                          | bigbench_cs_algorithms       |                                     |  0.480303  | 10-shot           | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |  0.281787  | 1-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | gsm8k                        |                                     |   0.148597 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | copa                         |                                     |   0.8      | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | commonsense_qa               |                                     |   0.383292 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | piqa                         |                                     |   0.786181 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |   0.614943 | 0-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |   0.585408 | 0-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | lambada_openai               |                                     |   0.736658 | 0-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | hellaswag                    |                                     |   0.74995  | 0-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | coqa                         |                                     |   0.4705   | 0-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | boolq                        |                                     |   0.792966 | 0-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | triviaqa_sm_sub              |                                     |   0.582333 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | jeopardy                     | Average                             |   0.508028 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | american_history                    |   0.564165 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | literature                          |   0.661224 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | science                             |   0.388655 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | word_origins                        |   0.30411  | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | world_history                       |   0.621984 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | bigbench_qa_wikidata         |                                     |   0.693125 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_easy                     |                                     |   0.757155 | 3-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_challenge                |                                     |   0.494881 | 3-shot            | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | siqa                         |                                     |   0.730809 | 3-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | winograd                     |                                     |   0.879121 | 3-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_operators           |                                     |   0.42381  | 3-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | squad                        |                                     |   0.532545 | 3-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | svamp                        |                                     |   0.42     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | mmlu                         | Average                             |   0.457122 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | abstract_algebra                    |   0.31     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | anatomy                             |   0.422222 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | astronomy                           |   0.460526 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | business_ethics                     |   0.48     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | clinical_knowledge                  |   0.418868 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_biology                     |   0.416667 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_chemistry                   |   0.28     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_computer_science            |   0.29     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_mathematics                 |   0.34     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_medicine                    |   0.421965 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | college_physics                     |   0.264706 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | computer_security                   |   0.56     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | conceptual_physics                  |   0.434043 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | econometrics                        |   0.307018 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | electrical_engineering              |   0.427586 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | elementary_mathematics              |   0.285714 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | formal_logic                        |   0.325397 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | global_facts                        |   0.41     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_biology                 |   0.512903 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_chemistry               |   0.349754 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_computer_science        |   0.45     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_european_history        |   0.606061 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_geography               |   0.520202 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_government_and_politics |   0.668394 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_macroeconomics          |   0.407692 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_mathematics             |   0.27037  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_microeconomics          |   0.403361 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_physics                 |   0.258278 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_psychology              |   0.592661 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_statistics              |   0.222222 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_us_history              |   0.578431 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | high_school_world_history           |   0.561181 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | human_aging                         |   0.565022 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | human_sexuality                     |   0.564885 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | international_law                   |   0.636364 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | jurisprudence                       |   0.546296 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | logical_fallacies                   |   0.521472 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | machine_learning                    |   0.339286 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | management                          |   0.514563 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | marketing                           |   0.67094  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | medical_genetics                    |   0.52     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | miscellaneous                       |   0.630907 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | moral_disputes                      |   0.523121 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | moral_scenarios                     |   0.250279 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | nutrition                           |   0.486928 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | philosophy                          |   0.553055 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | prehistory                          |   0.506173 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_accounting             |   0.368794 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_law                    |   0.34485  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_medicine               |   0.422794 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | professional_psychology             |   0.45098  | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | public_relations                    |   0.490909 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | security_studies                    |   0.420408 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | sociology                           |   0.666667 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | us_foreign_policy                   |   0.68     | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | virology                            |   0.475904 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          |                              | world_religions                     |   0.649123 | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |   0.291    | 5-shot            | meta-llama/Llama-2-7b-hf |
| language_understanding   | winogrande                   |                                     |   0.73086  | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |   0.252174 | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |   0.245    | 5-shot            | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |   0.256    | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |   0.373134 | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |   0.329412 | 5-shot            | meta-llama/Llama-2-7b-hf |
| reading_comprehension    | agi_eval_sat_en              |                                     |   0.368932 | 5-shot            | meta-llama/Llama-2-7b-hf |
| world_knowledge          | arc_challenge                |                                     |   0.514505 | 25-shot           | meta-llama/Llama-2-7b-hf |
| commonsense_reasoning    | openbook_qa                  |                                     |   0.458    | 10-shot           | meta-llama/Llama-2-7b-hf |
| language_understanding   | hellaswag                    |                                     |   0.773053 | 10-shot           | meta-llama/Llama-2-7b-hf |
|                          | bigbench_cs_algorithms       |                                     |   0.44697  | 10-shot           | meta-llama/Llama-2-7b-hf |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |   0.274371 | 1-shot            | meta-llama/Llama-2-7b-hf |

CODE
Pre-migration llama2-code-pre-migration-D3fXGe mpt7b-code-pre-migration-x8nPTd

| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model                    |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:-------------------------|
|            | human_eval                |           |  0.0853659 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_js             |           |  0.0487805 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_complex |           |  0.220472  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval                |           |  0.0426829 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_js             |           |  0.0121951 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_complex |           |  0.251969  | 0-shot            | mosaicml/mpt-7b |

Post-migration llama2-code-post-migration-aSqFno mpt7b-code-post-migration-3N0tKy

| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model                    |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:-------------------------|
|            | human_eval                |           |  0.0853659 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_js             |           |  0.0487805 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_complex |           |  0.220472  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval                |           |  0.0426829 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_js             |           |  0.0121951 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_complex |           |  0.251969  | 0-shot            | mosaicml/mpt-7b |

@bmosaicml bmosaicml mentioned this pull request Feb 21, 2024
Copy link
Contributor

@eitanturok eitanturok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks absolutely fire, best PR I've seen in my whole life :)

Will let Max do the final vetting.

Copy link
Contributor

@eitanturok eitanturok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks absolutely fire, best PR I've seen in my whole life :)

Will let Max do the final vetting.

@maxisawesome
Copy link
Contributor

It appears that a single svamp example was different btwn pre and post, so I reran some runs on svamp only. They produced the same results before/after the migration, so I am confident about our results there:
llama2-svamp-post-migration-0tvV1U
llama2-svamp-pre-migration-HZAfmD
svamp: 0.346667 both times

@maxisawesome
Copy link
Contributor

Llama2 human_eval pre-migration llama2-code-pre-migration-D3fXGe:

Printing complete results for all models
| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model                    |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:-------------------------|
|            | human_eval                |           |  0.0853659 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_js             |           |  0.0487805 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_complex |           |  0.220472  | 0-shot            | meta-llama/Llama-2-7b-hf |

Llama2 human_eval post-migration llama2-code-post-migration-aSqFno:

Printing complete results for all models
| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model                    |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:-------------------------|
|            | human_eval                |           |  0.0853659 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_cpp            |           |  0.0372671 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_js             |           |  0.0487805 | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_simple  |           |  0.675676  | 0-shot            | meta-llama/Llama-2-7b-hf |
|            | human_eval_return_complex |           |  0.220472  | 0-shot            | meta-llama/Llama-2-7b-hf |

poorly named mpt pre-migration run: llama2-code-pre-migration-gAc90c

Printing complete results for all models
| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model           |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:----------------|
|            | human_eval                |           |  0.097561  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_cpp            |           |  0.0434783 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_js             |           |  0.0426829 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_simple  |           |  0.810811  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_complex |           |  0.244094  | 0-shot            | mosaicml/mpt-7b |

mpt post-migration run: mpt7b-code-post-migration-oU1rq4

Printing complete results for all models
| Category   | Benchmark                 | Subtask   |   Accuracy | Number few shot   | Model           |
|:-----------|:--------------------------|:----------|-----------:|:------------------|:----------------|
|            | human_eval                |           |  0.097561  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_cpp            |           |  0.0434783 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_js             |           |  0.0426829 | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_simple  |           |  0.810811  | 0-shot            | mosaicml/mpt-7b |
|            | human_eval_return_complex |           |  0.244094  | 0-shot            | mosaicml/mpt-7b |

@maxisawesome
Copy link
Contributor

With these results I approve the PR!

@maxisawesome maxisawesome self-requested a review April 12, 2024 17:20
Copy link
Contributor

@maxisawesome maxisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eval is same before and after and I approve

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not review the code super closely, relying on Max's review for that. Results look good to me.

llmfoundry/utils/builders.py Outdated Show resolved Hide resolved
@maxisawesome maxisawesome merged commit 3729ba3 into main Apr 12, 2024
9 checks passed
@maxisawesome maxisawesome deleted the migrate_subclasses_to_foundry branch April 12, 2024 21:00
KuuCi pushed a commit that referenced this pull request Apr 18, 2024
* start

* still need to migrate fixtures

* wip onboarding tests

* still workin'

* still wip

* maybe done; test out on mcli now

* mcli

* remove calibration error

* migration

* migration

* full migration

* precommit

* fix

* fix pytests

* refactor QA

* update

* restore

* add

* fix

* wip

* update readme

* final pyright

* done

* pass prelimiter into ALL the ICL task datasets

* allow QA task name stil lfor backward compatibility

* fix

* fix test

* add generation length

* remove max_new_tokens

* fix cpu trsts

* try and fix lm eval test

* temp disable lm task eval test

* fix test?

* fix tet

* finish

* fix

* Update scripts/eval/README.md

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

* fix comments

* fix bug with seq len

* restore mcli

* merge

* fix builder

* add deprecation warning

* add deprecation warning

* merge

* merge

* add logging necessities to nlp.py

* add attention_mask test update

* fix generation_length in tests

* fix bug

* restore yamls

* fix typos

* add deprecation warning for code

* pyright wip

* fix pyright

* fix pyright error again

* fix pyright

* fix pyright

* update version

---------

Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com>
Co-authored-by: Max Marion <mmarion538@gmail.com>
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Co-authored-by: Max Marion <max.marion@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants