Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model gauntlet #308

Merged
merged 39 commits into from
Jun 29, 2023
Merged

Model gauntlet #308

merged 39 commits into from
Jun 29, 2023

Conversation

bmosaicml
Copy link
Contributor

@bmosaicml bmosaicml commented Jun 9, 2023

Created model gauntlet.

This PR makes a number of significant changes. It checks in 38 datasets, it adds a callback which can compute model gauntlet scores from a large number of benchmarks. It also documents the model gauntlet datasets in eval/local_data/README.md

Eval successfully runs and produces correct results

Printing gauntlet results for all models

| model_name               |   average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   programming |
|:-------------------------|----------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|--------------:|
| mosaicml/mpt-7b-instruct |  0.303923 |          0.400286 |                0.415097 |                 0.422248 |                   0.171216 |                0.414691 |             0 |

Printing complete results for all models

| Category                 | Benchmark                        | Subtask                             |   Accuracy |   Number few shot | Model                    |
|:-------------------------|:---------------------------------|:------------------------------------|-----------:|------------------:|:-------------------------|
| world_knowledge          | jeopardy                         | Average                             |  0.458112  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | american_history                    |  0.51816   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | literature                          |  0.540816  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | science                             |  0.34874   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | word_origins                        |  0.287671  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | world_history                       |  0.595174  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          | bigbench_qa_wikidata             |                                     |  0.694503  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          | arc_easy                         |                                     |  0.748737  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          | arc_challenge                    |                                     |  0.47099   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          | mmlu                             | Average                             |  0.312989  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | abstract_algebra                    |  0.31      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | anatomy                             |  0.311111  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | astronomy                           |  0.315789  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | business_ethics                     |  0.26      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | clinical_knowledge                  |  0.316981  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_biology                     |  0.256944  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_chemistry                   |  0.33      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_computer_science            |  0.29      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_mathematics                 |  0.29      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_medicine                    |  0.271676  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_physics                     |  0.264706  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | computer_security                   |  0.37      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | conceptual_physics                  |  0.33617   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | econometrics                        |  0.192982  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | electrical_engineering              |  0.324138  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | elementary_mathematics              |  0.259259  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | formal_logic                        |  0.301587  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | global_facts                        |  0.35      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_biology                 |  0.33871   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_chemistry               |  0.270936  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_computer_science        |  0.29      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_european_history        |  0.30303   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_geography               |  0.388889  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_government_and_politics |  0.362694  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_macroeconomics          |  0.325641  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_mathematics             |  0.288889  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_microeconomics          |  0.331933  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_physics                 |  0.311258  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_psychology              |  0.308257  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_statistics              |  0.388889  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_us_history              |  0.27451   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_world_history           |  0.261603  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | human_aging                         |  0.372197  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | human_sexuality                     |  0.374046  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | international_law                   |  0.31405   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | jurisprudence                       |  0.342593  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | logical_fallacies                   |  0.226994  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | machine_learning                    |  0.241071  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | management                          |  0.339806  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | marketing                           |  0.320513  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | medical_genetics                    |  0.34      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | miscellaneous                       |  0.386973  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | moral_disputes                      |  0.323699  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | moral_scenarios                     |  0.251397  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | nutrition                           |  0.366013  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | philosophy                          |  0.37299   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | prehistory                          |  0.33642   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | professional_accounting             |  0.27305   |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | professional_law                    |  0.273794  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | professional_medicine               |  0.220588  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | professional_psychology             |  0.287582  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | public_relations                    |  0.418182  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | security_studies                    |  0.334694  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | sociology                           |  0.308458  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | us_foreign_policy                   |  0.37      |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | virology                            |  0.385542  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | world_religions                     |  0.263158  |                10 | mosaicml/mpt-7b-instruct |
| world_knowledge          | bigbench_misconceptions          |                                     |  0.60274   |                10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | piqa                             |                                     |  0.806311  |                10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | bigbench_novel_concepts          |                                     |  0.53125   |                10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | bigbench_strange_stories         |                                     |  0.701149  |                10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | bigbench_strategy_qa             |                                     |  0.59633   |                10 | mosaicml/mpt-7b-instruct |
| language_understanding   | hellaswag                        |                                     |  0.769767  |                10 | mosaicml/mpt-7b-instruct |
| language_understanding   | bigbench_conlang_translation     |                                     |  0.0426829 |                10 | mosaicml/mpt-7b-instruct |
| language_understanding   | bigbench_language_identification |                                     |  0.2568    |                10 | mosaicml/mpt-7b-instruct |
| language_understanding   | bigbench_conceptual_combinations |                                     |  0.320388  |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_elementary_math_qa      |                                     |  0.270466  |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_dyck_languages          |                                     |  0.314     |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_cs_algorithms           |                                     |  0.496212  |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_logical_deduction       |                                     |  0.262667  |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_operators               |                                     |  0.352381  |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_repeat_copy_logic       |                                     |  0.3125    |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | simple_arithmetic_nospaces       |                                     |  0.078     |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | simple_arithmetic_withspaces     |                                     |  0.086     |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | math_qa                          |                                     |  0.257459  |                10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | logi_qa                          |                                     |  0.264209  |                10 | mosaicml/mpt-7b-instruct |
| reading_comprehension    | pubmed_qa_labeled                |                                     |  0.59      |                10 | mosaicml/mpt-7b-instruct |
| reading_comprehension    | squad                            |                                     |  0.586944  |                10 | mosaicml/mpt-7b-instruct |
| reading_comprehension    | bigbench_understanding_fables    |                                     |  0.195767  |                10 | mosaicml/mpt-7b-instruct |
| reading_comprehension    | boolq                            |                                     |  0.777064  |                10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | copa                             |                                     |  0.83      |                 0 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | openbook_qa                      |                                     |  0.436     |                 0 | mosaicml/mpt-7b-instruct |
| language_understanding   | lambada_openai                   |                                     |  0.69086   |                 0 | mosaicml/mpt-7b-instruct |
| language_understanding   | winograd                         |                                     |  0.846154  |                 0 | mosaicml/mpt-7b-instruct |
| language_understanding   | winogrande                       |                                     |  0.67719   |                 0 | mosaicml/mpt-7b-instruct |

@bmosaicml bmosaicml force-pushed the model_gauntlet branch 5 times, most recently from 7d0eb16 to f464b26 Compare June 10, 2023 14:08
@bmosaicml bmosaicml force-pushed the model_gauntlet branch 4 times, most recently from f0c7c54 to da4fb0c Compare June 20, 2023 02:18
mcli/mcli-hf-eval.yaml Outdated Show resolved Hide resolved
scripts/eval/yamls/model_gauntlet.yaml Show resolved Hide resolved
mcli/mcli-hf-eval.yaml Outdated Show resolved Hide resolved
@vchiley vchiley merged commit ffcc568 into main Jun 29, 2023
10 checks passed
@dakinggg dakinggg deleted the model_gauntlet branch October 11, 2023 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants