Skip to content

Commit

Permalink
[DEV] Analyzing length-controlled metrics. (#231)
Browse files Browse the repository at this point in the history
* [MODELS] add all the verbose/concise models

* [DATA] add all the gameable results

* remove all the gameable models from leaderboard

* Add length correction notebook.

* Pass tests
  • Loading branch information
YannDubs authored Feb 11, 2024
1 parent 45ec7c0 commit 34e01fa
Show file tree
Hide file tree
Showing 55 changed files with 827,047 additions and 3,257 deletions.
8 changes: 4 additions & 4 deletions docs/data_AlpacaEval/alpaca_eval_gpt4_leaderboard.csv
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
name,win_rate,avg_length,link,samples,filter
GPT-4 Turbo,97.69900497512438,2049,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4_turbo/model_outputs.json,minimal
GPT-4 Turbo,97.69900497512438,2049,,,minimal
XwinLM 70b V0.3,97.636815920398,2113,https://github.com/Xwin-LM/Xwin-LM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/xwinlm-70b-v0.3/model_outputs.json,community
Mistral Medium,96.83229813664596,1500,https://mistral.ai/news/la-plateforme/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/mistral-medium/model_outputs.json,minimal
XwinLM 70b V0.1,95.56803995,1775,https://github.com/Xwin-LM/Xwin-LM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/xwinlm-70b-v0.1/model_outputs.json,community
PairRM 0.4B+Tulu 2+DPO 70B (best-of-16),95.39800995024876,1607,https://huggingface.co/llm-blender/PairRM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/pairrm-tulu-2-70b/model_outputs.json,community
GPT-4,95.27950311,1365,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4/model_outputs.json,minimal
Tulu 2+DPO 70B,95.03105590062113,1418,https://huggingface.co/allenai/tulu-2-dpo-70b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/tulu-2-dpo-70b/model_outputs.json,community
GPT-4 0314,94.78260869565216,1371,,,verified
GPT-4 0314,94.78260869565216,1371,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4_0314/model_outputs.json,verified
Mixtral 8x7B v0.1,94.78260869565216,1465,https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Mixtral-8x7B-Instruct-v0.1/model_outputs.json,minimal
Yi 34B Chat,94.08468244084682,2123,https://huggingface.co/01-ai/Yi-34B-Chat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Yi-34B-Chat/model_outputs.json,verified
GPT-4 0613,93.78109452736318,1140,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4_0613/model_outputs.json,verified
GPT 3.5 Turbo 0613,93.41614906832298,1328,,,verified
GPT 3.5 Turbo 0613,93.41614906832298,1328,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt-3.5-turbo-16k-0613/model_outputs.json,verified
PairRM 0.4B+Zephyr 7B Beta (best-of-16),93.40796019900498,1487,https://huggingface.co/llm-blender/PairRM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/pairrm-zephyr-7b-beta/model_outputs.json,community
Mistral 7B v0.2,92.77708592777088,1676,https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/Mistral-7B-Instruct-v0.2/model_outputs.json,minimal
LLaMA2 Chat 70B,92.66169154,1790,https://ai.meta.com/llama/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/llama-2-70b-chat-hf/model_outputs.json,minimal
Expand Down Expand Up @@ -39,7 +39,7 @@ OpenChat V2-W 13B,87.12686567,1566,https://github.com/imoneoi/openchat,https://g
Claude 2.1,87.0807453416149,1096,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude-2.1/model_outputs.json,minimal
OpenBuddy-LLaMA-65B-v8,86.53366584,1162,https://huggingface.co/OpenBuddy/openbuddy-llama-65b-v8-bf16,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openbuddy-llama-65b-v8/model_outputs.json,community
WizardLM 13B V1.1,86.31840796,1525,https://huggingface.co/WizardLM/WizardLM-13B-V1.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/wizardlm-13b-v1.1/model_outputs.json,community
GPT 3.5 Turbo 1106,86.25621890547264,796,,,verified
GPT 3.5 Turbo 1106,86.25621890547264,796,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt-3.5-turbo-1106/model_outputs.json,verified
Zephyr 7B Alpha,85.7587064676617,1302,https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/zephyr-7b-alpha/model_outputs.json,community
OpenChat V2 13B,84.9689441,1564,https://github.com/imoneoi/openchat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openchat-v2-13b/model_outputs.json,community
Tulu 2+DPO 7B,84.22360248447205,1663,https://huggingface.co/allenai/tulu-2-dpo-7b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/tulu-2-dpo-7b/model_outputs.json,community
Expand Down

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -317,7 +317,7 @@ <h2>AlpacaEval limitations</h2>
if (row['name'] || row['win_rate'] || row['avg_length']) {
let filter = row['filter'];

if ((communityRadio.checked) ||
if ((communityRadio.checked && (filter === 'verified' || filter === 'minimal' || filter === 'community')) ||
(verifiedRadio.checked && (filter === 'verified' || filter === 'minimal'))) {

const tr = document.createElement('tr');
Expand Down
Loading

0 comments on commit 34e01fa

Please sign in to comment.