-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add reasonbench dataset #577
Conversation
Please provide the prompt for generation setting(for chat model) |
Also please use pre-commit to lint the code |
add new features by other contributors
In the new commits, I provide prompt and configs for supporting generative inference as well as merge datasets that in the same category. After ensuring that OpenCompass is installed correctly and the datasets are prepared, we are able to evaluate the performance of the LLaMA-7b model on the ReasonBench datasets in a generative way using the following command:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* [Feature] Add reasonbench dataset * add configs for supporting generative inference & merge datasets in the same category * modify config filename to prompt version * fix codes to meet pre-commit requirements * lint the code to meet pre-commit requirements * Align Load_data Sourcecode Briefly * fix bugs * reduce code redundancy
* [Feature] Add reasonbench dataset * add configs for supporting generative inference & merge datasets in the same category * modify config filename to prompt version * fix codes to meet pre-commit requirements * lint the code to meet pre-commit requirements * Align Load_data Sourcecode Briefly * fix bugs * reduce code redundancy
Motivation
Add a benchmark that support specially evaluating reasoning ability of LLMs.
Modification
Add preprocess methods and configs for several datasets, aiming evaluating the abilities of inductive reasoning, deductive reasoning, abductive reasoning, causal reasoning, symbolic reasoning and commonsense reasoning.
BC-breaking (Optional)
N/A.
Use cases (Optional)
After ensuring that OpenCompass is installed correctly according to the previous steps and the datasets are prepared, you can evaluate the performance of the LLaMA-7b model on the ReasonBench datasets using the following command:
python run.py --models hf_llama_7b --datasets reasonbench_ppl