Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gauntlet v0.1.0 yaml fixes #748

Merged
merged 116 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from 111 commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
a4c83d3
update yaml
bmosaicml Aug 11, 2023
fbf43b0
adding datasets
bmosaicml Aug 18, 2023
7d9c367
adding datasets
bmosaicml Aug 18, 2023
5054667
Merge branch 'model_gauntlet_v0.1' of github.com:mosaicml/llm-foundry…
bmosaicml Aug 24, 2023
b32224a
added agi eval
bmosaicml Aug 31, 2023
46b27a3
test CoT eval
bmosaicml Sep 5, 2023
4d82180
fix broken eval yaml
bmosaicml Sep 5, 2023
b1bbff1
fix broken eval yaml
bmosaicml Sep 5, 2023
7554b01
Merge branch 'model_gauntlet_v0.1' of github.com:mosaicml/llm-foundry…
bmosaicml Sep 7, 2023
1845ffa
merge main
bmosaicml Sep 11, 2023
1508a9f
debugging
bmosaicml Sep 11, 2023
b10c16c
debugging
bmosaicml Sep 11, 2023
85f081e
commit
bmosaicml Sep 28, 2023
de28be4
commit
bmosaicml Sep 28, 2023
193b4e0
commit
bmosaicml Sep 29, 2023
e896c09
commit
bmosaicml Sep 29, 2023
a16aea1
commit
bmosaicml Sep 29, 2023
6f5fc2c
restore mcli
bmosaicml Oct 2, 2023
7968ca4
adding simple tasks
mcarbin Sep 22, 2023
0f8b160
Merge branch 'main' into human_eval_simple
bmosaicml Oct 3, 2023
a84dda0
add simple human_eval
bmosaicml Oct 3, 2023
90a2dca
fix yaml
bmosaicml Oct 3, 2023
4c54e3d
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 3, 2023
bcda9ef
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 4, 2023
48060e1
fix yaml
bmosaicml Oct 4, 2023
4eee5c1
Merge branch 'mike/human-eval-simple' of github.com:mcarbin/llm-found…
bmosaicml Oct 4, 2023
9534d73
remove breakpoint
bmosaicml Oct 5, 2023
51d1b72
remove breakpoint
bmosaicml Oct 5, 2023
e11bb34
change bsz
bmosaicml Oct 6, 2023
fb750f3
Merge branch 'main' into human_eval_simple
bmosaicml Oct 9, 2023
b50da7d
merge main
bmosaicml Oct 9, 2023
4b29f92
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 9, 2023
07c2c07
Merge branch 'main' into change_gauntlet_avging
bmosaicml Oct 9, 2023
841eac0
eval gauntlet cb
bmosaicml Oct 9, 2023
94100fb
Merge branch 'main' into change_gauntlet_avging
bmosaicml Oct 9, 2023
211def8
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 9, 2023
5cf9b89
Merge branch 'main' into change_gauntlet_avging
dakinggg Oct 10, 2023
9159dc8
Merge branch 'main' into change_gauntlet_avging
dakinggg Oct 10, 2023
043f473
Merge branch 'main' into mike/human-eval-simple
dakinggg Oct 10, 2023
d1744fe
Merge branch 'main' into change_gauntlet_avging
dakinggg Oct 10, 2023
cdc2065
add udpated readme
bmosaicml Oct 10, 2023
eb09668
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 10, 2023
cdbd3d1
Merge branch 'mike/human-eval-simple' of github.com:mcarbin/llm-found…
bmosaicml Oct 10, 2023
adeb790
Merge branch 'main' into change_gauntlet_avging
bmosaicml Oct 10, 2023
26107e8
merge
bmosaicml Oct 10, 2023
2fd5c16
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 10, 2023
c8a21c6
merge2
bmosaicml Oct 10, 2023
71cd420
fix precommit
bmosaicml Oct 10, 2023
ff68b53
Merge branch 'mike/human-eval-simple' of github.com:mcarbin/llm-found…
bmosaicml Oct 10, 2023
bd8a9f8
Merge branch 'main' into change_gauntlet_avging
bmosaicml Oct 10, 2023
47ebdfc
add pii
bmosaicml Oct 11, 2023
2d17293
restor line
bmosaicml Oct 11, 2023
8b7abf1
restor line
bmosaicml Oct 11, 2023
ed3f52a
finish non gen safety
bmosaicml Oct 11, 2023
f96bbce
add execution predicrtion
bmosaicml Oct 11, 2023
6ff9133
add execution prediction
bmosaicml Oct 11, 2023
bb7a847
add execution prediction
bmosaicml Oct 11, 2023
6ee0f48
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Oct 11, 2023
f332091
change mosaicml reqs
bmosaicml Oct 11, 2023
bb89c97
change mosaicml reqs
bmosaicml Oct 11, 2023
966b1bb
fix error
bmosaicml Oct 11, 2023
977b48f
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Oct 11, 2023
e15246e
comment
bmosaicml Oct 11, 2023
e40463b
test smaller beams
bmosaicml Oct 12, 2023
4a4d021
tesT
bmosaicml Oct 12, 2023
c5e30ec
tesT
bmosaicml Oct 12, 2023
19c8b2b
tesT
bmosaicml Oct 13, 2023
441b004
Merge branch 'change_gauntlet_avging' into execution_prediction
bmosaicml Oct 13, 2023
3025517
add coding task
bmosaicml Oct 13, 2023
5e444e3
tesT
bmosaicml Oct 13, 2023
b55bdda
finish eval
bmosaicml Oct 16, 2023
2d77a0e
finish data
bmosaicml Oct 19, 2023
c025ac4
fix
bmosaicml Oct 19, 2023
2702e15
fix
bmosaicml Oct 19, 2023
d393603
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Oct 24, 2023
0350950
remove strategyqa cot
bmosaicml Oct 24, 2023
27754e8
merge
bmosaicml Oct 24, 2023
b00a806
remove
bmosaicml Oct 25, 2023
7b68b55
remove
bmosaicml Oct 25, 2023
57e35cd
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Oct 26, 2023
814a8c9
foo
bmosaicml Oct 26, 2023
b142b57
edit
bmosaicml Oct 27, 2023
a56b30e
Merge branch 'main' into execution_prediction
bmosaicml Nov 2, 2023
368d232
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Nov 2, 2023
d86ece1
fix
bmosaicml Nov 2, 2023
0477773
rm breakpoint
bmosaicml Nov 2, 2023
7fed6a2
rm breakpoint
bmosaicml Nov 2, 2023
48bf32b
remove execution prediction; make coding optional
bmosaicml Nov 9, 2023
f2f59df
remove execution prediction; make coding optional
bmosaicml Nov 9, 2023
01b3103
merge
bmosaicml Nov 9, 2023
20aeafd
remove import
bmosaicml Nov 9, 2023
46b7d72
remove import
bmosaicml Nov 9, 2023
93fa556
restore files
bmosaicml Nov 10, 2023
f17f8c4
restore
bmosaicml Nov 10, 2023
a9606b4
restore
bmosaicml Nov 10, 2023
cd08024
Merge branch 'main' into execution_prediction
bmosaicml Nov 10, 2023
f0c0b9d
update readm; rename gauntlet yamls
bmosaicml Nov 12, 2023
4572056
edit yamls
bmosaicml Nov 12, 2023
297bfa2
Merge branch 'main' into execution_prediction
bmosaicml Nov 12, 2023
e6b696f
fix yamllint
bmosaicml Nov 13, 2023
a1f8ffb
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Nov 13, 2023
6014dd4
restore mpt eval
bmosaicml Nov 13, 2023
90ef388
Merge branch 'main' into execution_prediction
bmosaicml Nov 17, 2023
1357cb8
Merge branch 'main' into execution_prediction
bmosaicml Nov 20, 2023
65d9604
finish
bmosaicml Nov 20, 2023
dc525be
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Nov 20, 2023
d5f3139
follow up
bmosaicml Nov 20, 2023
00a0426
fix
bmosaicml Nov 20, 2023
5e2101d
Merge branch 'main' into execution_prediction
dakinggg Nov 21, 2023
4bc2594
precommit
bmosaicml Nov 21, 2023
6eb14a5
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Nov 21, 2023
c1e2fc9
Merge branch 'main' into execution_prediction
bmosaicml Nov 21, 2023
970a4df
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Nov 21, 2023
0d650c1
Merge branch 'main' into execution_prediction
dakinggg Nov 21, 2023
afbd992
precommit
bmosaicml Nov 21, 2023
fd890fe
Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…
bmosaicml Nov 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion scripts/eval/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# In-context learning (ICL) evaluation

This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Model Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/eval_gauntlet.md) organized into 6 broad categories of competency that we expect good foundation models to have.
This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Eval Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/eval_gauntlet.md) organized into 6 broad categories of competency that we expect good foundation models to have.

You can evaluate a model by preparing an evaluation YAML following the format of the examples in the [`scripts/eval/yamls` directory](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval/yamls).

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Mosaic Model Gauntlet v0 - Evaluation Suite
# Mosaic Eval Gauntlet v0.1.0 - Evaluation Suite


<!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_BEGIN -->
Expand All @@ -7,9 +7,9 @@
<img alt="LLM Foundry" src="../../../assets/radar_blog.png" width="60%">
</picture>
<br>
MPT-7B vs MPT-30B compared on the 6 categories of Model Gauntlet.
MPT-7B vs MPT-30B compared on the 6 categories of Eval Gauntlet v0.
</p>
The Mosaic Model Gauntlet is MosaicML’s new technique for evaluating the quality of pretrained foundation models. The Model Gauntlet encompasses 35 different benchmarks collected from a variety of sources, and organized into 6 broad categories of competency that we expect good foundation models to have. We compiled the categories after an extensive review of existing LLM publications, and open source evaluation harnesses such as EleutherAI Eval Harness and Stanford CRFM’s HELM.
The Mosaic Eval Gauntlet is MosaicML’s new technique for evaluating the quality of pretrained foundation models. The Eval Gauntlet encompasses 35 different benchmarks collected from a variety of sources, and organized into 6 broad categories of competency that we expect good foundation models to have. We compiled the categories after an extensive review of existing LLM publications, and open source evaluation harnesses such as EleutherAI Eval Harness and Stanford CRFM’s HELM.

<br>
While deciding which benchmarks to include, we had a few criteria in mind. We wanted benchmarks to require a broad range of skills that were useful for practical applications, we wanted them to come from a diverse range of sources, we wanted them to capture skills that have been traditionally emphasized by the research community as well as those that have been underexplored, and we wanted them to be evaluated via simple, unambiguous metrics such as exact match and multiple choice accuracy. The philosophy behind compiling aggregate scores as opposed to the more common approach of reporting individual metrics, is two-fold.
Expand All @@ -24,7 +24,7 @@ At evaluation time, we run all the benchmarks, average the subscores within each

For example, if benchmark A has a random baseline accuracy of 25%, and the model achieved 30%, we would report this as (0.3 - 0.25)/(1-0.25) = 0.0667. This can be thought of as the accuracy above chance rescaled so the max is 1. For benchmarks in which the random guessing baseline accuracy is ~0 we report the accuracy as is. Note that with this rescaling, a model could technically score below 0 on a category as a whole, but we haven’t found this to occur with any of the models we’ve tested.

This is version v0, in the coming weeks we will update the mixture to include more benchmarks.
This is version v0.1.0 of the Eval Gauntlet.

### Reading Comprehension

Expand Down Expand Up @@ -349,7 +349,7 @@ The Safety category consists of benchmarks designed to assess model's toxicity,
- Random baseline accuracy: 50%

### Programming
Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more.
Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more. By default the programming tasks are disabled in `scripts/eval/yamls/tasks.yaml` due to their long duration.

51. HumanEval Python code generation
- Description: HumanEval Python consists of 164 python programming challenges, in which the model is presented with the method signature and docstring comment for a python program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs.
Expand Down
5 changes: 5 additions & 0 deletions scripts/eval/yamls/copa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
-
label: copa
dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
54 changes: 27 additions & 27 deletions scripts/eval/yamls/tasks.yaml
Original file line number Diff line number Diff line change
@@ -1,69 +1,69 @@
icl_tasks:
-
label: jeopardy
dataset_uri: eval/local_data/world_knowledge/jeopardy_all.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/jeopardy_all.jsonl
num_fewshot: [10]
icl_task_type: language_modeling
continuation_delimiter: "\nAnswer: " # this separates questions from answers
has_categories: true
-
label: bigbench_qa_wikidata
dataset_uri: eval/local_data/world_knowledge/bigbench_qa_wikidata.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/bigbench_qa_wikidata.jsonl
num_fewshot: [10]
icl_task_type: language_modeling
-
label: arc_easy
dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
-
label: arc_challenge
dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
-
label: mmlu
dataset_uri: eval/local_data/world_knowledge/mmlu.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/mmlu.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
has_categories: true
-
label: bigbench_misconceptions
dataset_uri: eval/local_data/world_knowledge/bigbench_misconceptions.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/bigbench_misconceptions.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: copa
dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
-
label: piqa
dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
-
label: openbook_qa
dataset_uri: eval/local_data/commonsense_reasoning/openbook_qa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/openbook_qa.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
-
label: bigbench_novel_concepts
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_novel_concepts.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_novel_concepts.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: bigbench_strange_stories
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strange_stories.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strange_stories.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: bigbench_strategy_qa
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strategy_qa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/bigbench_strategy_qa.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
Expand All @@ -73,17 +73,17 @@ icl_tasks:
icl_task_type: language_modeling
-
label: hellaswag
dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: winograd
dataset_uri: eval/local_data/language_understanding/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/language_understanding/winograd_wsc.jsonl
num_fewshot: [0]
icl_task_type: schema
-
label: winogrande
dataset_uri: eval/local_data/language_understanding/winogrande.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/language_understanding/winogrande.jsonl
num_fewshot: [0]
icl_task_type: schema
-
Expand Down Expand Up @@ -154,84 +154,84 @@ icl_tasks:
continuation_delimiter: "\nAnswer: " # this separates questions from answers
-
label: pubmed_qa_labeled
dataset_uri: eval/local_data/reading_comprehension/pubmed_qa_labeled.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/reading_comprehension/pubmed_qa_labeled.jsonl
num_fewshot: [10]
icl_task_type: language_modeling
-
label: squad
dataset_uri: eval/local_data/reading_comprehension/squad.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/reading_comprehension/squad.jsonl
num_fewshot: [10]
icl_task_type: language_modeling
-
label: bigbench_understanding_fables
dataset_uri: eval/local_data/reading_comprehension/bigbench_understanding_fables.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/reading_comprehension/bigbench_understanding_fables.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: boolq
dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
# -
# label: human_eval
# dataset_uri: eval/local_data/programming/human_eval.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_cpp
# dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_js
# dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_return_simple
# dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_return_complex
# dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_25
# dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_50
# dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
# batch_size: 1
# icl_task_type: code_evaluation
# -
# label: human_eval_75
# dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl # ADD YOUR OWN DATASET URI
# dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl
# num_fewshot: [0]
# pass_at_k: 1
# num_beams: 20
Expand Down
Loading
Loading