AgentTuning 7b evaluate in HH， not expect as paper result #39

Dhaizei · 2023-11-06T07:59:46Z

https://huggingface.co/THUDM/agentlm-7b ， I try it,but far below 84% in alfworld-std. Is it the wrong model？

Dhaizei · 2023-11-06T08:42:10Z

{
"total": 50,
"validation": {
"running": 0.0,
"completed": 0.1,
"agent context limit": 0.0,
"agent validation failed": 0.0,
"agent invalid action": 0.62,
"task limit reached": 0.28,
"unknown": 0.0,
"task error": 0.0,
"average_history_length": 62.22,
"max_history_length": 91,
"min_history_length": 20
},
"custom": {
"overall": {
"total": 50,
"pass": 5,
"wrong": 45,
"success_rate": 0.1
}
}
}

lr-tsinghua11 · 2023-11-06T11:06:36Z

Your output seems like there may be a mismatch in the evaluation setup you've used. Please ensure that you're using the evaluation code from ./AgentBench.old as mentioned in README, not the latest repo THUDM/AgentBench. Could you kindly provide your trajectories for a thorough review?

Dhaizei · 2023-11-13T03:50:56Z

Yes, when I use the latest version of them, where do I send the trajectory information?

Dhaizei · 2023-11-13T03:53:27Z

But I can get to 0.84 with gpt-4

{
"total": 50,
"validation": {
"running": 0.0,
"completed": 0.84,
"agent context limit": 0.0,
"agent validation failed": 0.0,
"agent invalid action": 0.04,
"task limit reached": 0.12,
"unknown": 0.0,
"task error": 0.0,
"average_history_length": 50.56,
"max_history_length": 91,
"min_history_length": 21
},
"custom": {
"overall": {
"total": 50,
"pass": 42,
"wrong": 8,
"success_rate": 0.84
}
}
}

Dhaizei · 2023-11-13T05:43:58Z

here is my trajectories for a thorough review in HH.
链接：https://pan.baidu.com/s/1Np291cysxDQDozzr4RiJDQ?pwd=1ijk
提取码：1ijk

lr-tsinghua11 · 2023-11-16T10:20:19Z

As mentioned in https://github.com/THUDM/AgentTuning#held-in-tasks

The 6 held-in tasks are selected from AgentBench. However, since AgentBench is still under active development, the results from the latest branch might not fully reproduce the results reported in the paper. The evaluation code of this project is located in ./AgentBench.old.

Please use the AgentBench.old directory at AgentBench.old for Agent task evaluation.

Dhaizei · 2023-11-17T02:23:12Z

But it's just a lot below the latest Agentbench test. a bit unexpected. Make sure that the uploaded model is okay.

Dhaizei · 2023-11-17T03:38:26Z

How much epoch have you trained?

Btlmd · 2023-11-19T17:07:14Z

How much epoch have you trained?

The models are trained for 2k steps, batch size 64, sequence length 4096 with packing.

Dhaizei · 2023-11-20T05:27:16Z

I use fastchat to fine tune llama2, but the effect was not very ideal. Can you use fastchat to achieve the effect of the paper after fine tuning? Although the batch size I set is not very large at 2, the improvement in completing tasks after fine-tuning is not significant. Do you have any good suggestions?
In addition, chatglm3-6B can reach 64% in HH tasks, which also proves the effectiveness of AgentTuning.

Dhaizei · 2023-11-20T08:47:00Z

in addtion ， one of AgentInstruct data is invalid :
{
"conversations": [
{
"from": "human",
"loss": false,
"value": "'''\n Menu

Model S Model 3 Model X Model Y

Email Address

Zip Code

Contact

\n'''\n\nBased on the HTML webpage above, try to complete the following task:\nTask: Schedule a demo drive for Model Y for Roy Adams with phone number 123-999-0000, email address RA@gmail.com and zip code 90001 in the United States.\nPrevious actions:\n[link] Demo Drive -> CLICK\n[button] Model Y -> CLICK\n[textbox] Last Name -> TYPE: Adams\n[textbox] First Name -> TYPE: Roy\n[textbox] Phone Number -> TYPE: 123-999-0000\nWhat should be the next action? Please select from the following choices (If the correct action is not in the page above, please select A. 'None of the above'):\n\nA. None of the above\nB. Menu \nC. Model Y \nD. \nE. \nF. Contact "
},
{
"from": "gpt",
"loss": true,
"value": ""
}
],
"id": "mind2web_60"
}

Dhaizei · 2023-11-21T07:51:33Z

Since I achieved poor results after fine-tuning with FastChat, I intend to further improve its capabilities by increasing the dataset size.
The approach of expanding the dataset size by using the training data from the AlfWorld dataset , and then evaluating it.
Can this approach be effective? Could you provide some advice?

Dhaizei · 2023-11-21T07:54:21Z

Is alfworld's prompt "alfworld_multiturn_new.json" better than "alfworld_multiturn_react.json"?

Dhaizei changed the title ~~AgentTuning 7b evaluate in HH， not expect as paper~~ AgentTuning 7b evaluate in HH， not expect as paper result Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AgentTuning 7b evaluate in HH， not expect as paper result #39

AgentTuning 7b evaluate in HH， not expect as paper result #39

Dhaizei commented Nov 6, 2023

Dhaizei commented Nov 6, 2023

lr-tsinghua11 commented Nov 6, 2023

Dhaizei commented Nov 13, 2023

Dhaizei commented Nov 13, 2023

Dhaizei commented Nov 13, 2023 •

edited

Loading

lr-tsinghua11 commented Nov 16, 2023

Dhaizei commented Nov 17, 2023

Dhaizei commented Nov 17, 2023 •

edited

Loading

Btlmd commented Nov 19, 2023

Dhaizei commented Nov 20, 2023

Dhaizei commented Nov 20, 2023

Dhaizei commented Nov 21, 2023

Dhaizei commented Nov 21, 2023

AgentTuning 7b evaluate in HH， not expect as paper result #39

AgentTuning 7b evaluate in HH， not expect as paper result #39

Comments

Dhaizei commented Nov 6, 2023

Dhaizei commented Nov 6, 2023

lr-tsinghua11 commented Nov 6, 2023

Dhaizei commented Nov 13, 2023

Dhaizei commented Nov 13, 2023

Dhaizei commented Nov 13, 2023 • edited Loading

lr-tsinghua11 commented Nov 16, 2023

Dhaizei commented Nov 17, 2023

Dhaizei commented Nov 17, 2023 • edited Loading

Btlmd commented Nov 19, 2023

Dhaizei commented Nov 20, 2023

Dhaizei commented Nov 20, 2023

Dhaizei commented Nov 21, 2023

Dhaizei commented Nov 21, 2023

Dhaizei commented Nov 13, 2023 •

edited

Loading

Dhaizei commented Nov 17, 2023 •

edited

Loading