Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about mantis-eval matching criteria #7

Closed
azshue opened this issue Jun 4, 2024 · 2 comments
Closed

Question about mantis-eval matching criteria #7

azshue opened this issue Jun 4, 2024 · 2 comments

Comments

@azshue
Copy link

azshue commented Jun 4, 2024

Hi,

Thank you for open-sourcing this great work. I appreciate the team's efforts in putting this together.

I have a question about the evaluation criteria in the mantis-eval, "short-answer" question specifically. It looks like the correctness of "short-answer" is judged by exact match between model's output and the reference answer, without further parsing(see the edit below). But the prompt template for this type of question also instructs the model to output both analysis and final answer.

In this case, I noticed that a model would give the correct answer (for example, "Yes") followed by some reasoning, but such an answer wouldn't be counted as correct because of how the exact match works.

Could you help me understand why it's written like this? Does it make sense to improve the matching rule? Thanks.

Edit:
I just saw that there is parsing on the model's output that only takes the outputs after "Final Answer: ". This makes much more sense. However, I noticed that sometimes a model would answer correctly but with more than one word.
Do you think it makes sense to loosen the matching criteria? Alternatively, I think it also makes sense to make the instruction more clear in the prompt template, for example, by adding one more sentence like "Answer the question in a single word or phrase."

@wenhuchen
Copy link
Contributor

Thanks for the suggestion. This makes a lot of sense. I think the Boolean and numerical ones can be matched more flexibly. It should boost the final score a bit.

@jdf-prog
Copy link
Collaborator

@azshue Thanks for raising the concern. Yeah, our written exact matching still have some space to improve. We do encourage you to modify the parsing script a bit to make it better and more reasonable.

Besides, it's also worth to notice that short answer question only occupy a small portion of Mantis-Eval (7.8% exactly, see hugging face dataset viewer statistics), so the performance boosting will somehow be limited a bit. However, better exact matching rules are still welcomed.

@jdf-prog jdf-prog closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants