Question about mantis-eval matching criteria #7

azshue · 2024-06-04T23:21:31Z

Hi,

Thank you for open-sourcing this great work. I appreciate the team's efforts in putting this together.

I have a question about the evaluation criteria in the mantis-eval, "short-answer" question specifically. It looks like the correctness of "short-answer" is judged by exact match between model's output and the reference answer, ~~without further parsing~~(see the edit below). But the prompt template for this type of question also instructs the model to output both analysis and final answer.

In this case, I noticed that a model would give the correct answer (for example, "Yes") followed by some reasoning, but such an answer wouldn't be counted as correct because of how the exact match works.

Could you help me understand why it's written like this? Does it make sense to improve the matching rule? Thanks.

Edit:
I just saw that there is parsing on the model's output that only takes the outputs after "Final Answer: ". This makes much more sense. However, I noticed that sometimes a model would answer correctly but with more than one word.
Do you think it makes sense to loosen the matching criteria? Alternatively, I think it also makes sense to make the instruction more clear in the prompt template, for example, by adding one more sentence like "Answer the question in a single word or phrase."

wenhuchen · 2024-06-25T21:51:56Z

Thanks for the suggestion. This makes a lot of sense. I think the Boolean and numerical ones can be matched more flexibly. It should boost the final score a bit.

jdf-prog · 2024-06-26T17:12:05Z

@azshue Thanks for raising the concern. Yeah, our written exact matching still have some space to improve. We do encourage you to modify the parsing script a bit to make it better and more reasonable.

Besides, it's also worth to notice that short answer question only occupy a small portion of Mantis-Eval (7.8% exactly, see hugging face dataset viewer statistics), so the performance boosting will somehow be limited a bit. However, better exact matching rules are still welcomed.

jdf-prog closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about mantis-eval matching criteria #7

Question about mantis-eval matching criteria #7

azshue commented Jun 4, 2024 •

edited

Loading

wenhuchen commented Jun 25, 2024

jdf-prog commented Jun 26, 2024

Question about mantis-eval matching criteria #7

Question about mantis-eval matching criteria #7

Comments

azshue commented Jun 4, 2024 • edited Loading

wenhuchen commented Jun 25, 2024

jdf-prog commented Jun 26, 2024

azshue commented Jun 4, 2024 •

edited

Loading