Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self taught evaluators project release #7

Merged
merged 56 commits into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
8627a36
add prompt templates and readme
xianxl Sep 23, 2024
1bf73bd
Update README.md
xianxl Sep 23, 2024
837e3f9
update readme
xianxl Sep 23, 2024
bf0d48e
Add files via upload
xianxl Sep 23, 2024
0adbfe1
Delete projects/self_taught_evaluator/figures/CoPE.png
xianxl Sep 23, 2024
eb66ca7
Update README.md
xianxl Sep 23, 2024
f86ca84
Add files via upload
xianxl Sep 23, 2024
dfcd8f4
Update README.md
xianxl Sep 23, 2024
ce14a98
Update README.md
xianxl Sep 24, 2024
a77cad3
more scripts
xianxl Sep 24, 2024
63235bc
Add files via upload
xianxl Sep 24, 2024
a2bdec4
Update README.md
xianxl Sep 24, 2024
13bc0d1
Delete projects/self_taught_evaluator/figures/self_taught_sft.png
xianxl Sep 24, 2024
4ee7e7e
Delete projects/self_taught_evaluator/figures/self_taught_sft.pdf
xianxl Sep 24, 2024
4310311
evals
Sep 25, 2024
947e800
use public model dirs
Sep 25, 2024
8158fe0
small fix
Sep 25, 2024
38de806
headers added
Sep 25, 2024
dbffc0c
rename dir
Sep 25, 2024
c442201
typo
Sep 25, 2024
3d18c8a
add training data prepration scripts
xianxl Sep 25, 2024
b5000eb
nits
Sep 25, 2024
a5eb8da
training config for dpo added
Sep 26, 2024
2a24bcf
nit
xianxl Sep 26, 2024
018b999
Update README.md
xianxl Sep 26, 2024
1b81556
Update README.md
xianxl Sep 26, 2024
0c53db4
add headers
xianxl Sep 26, 2024
d4ada58
headers
xianxl Sep 26, 2024
76d08ea
black
xianxl Sep 26, 2024
45eca68
minor fix
xianxl Sep 26, 2024
d90dabf
sft config added
Sep 26, 2024
da3d90a
Merge branch 'main' into self_taught
Sep 26, 2024
f240550
training data link in the readme
Sep 26, 2024
c3d5ac3
small fix
Sep 26, 2024
507df50
removing large files
Sep 26, 2024
220dd0e
Update projects/self_taught_evaluator/README.md
uralik Sep 26, 2024
491c055
Update projects/self_taught_evaluator/README.md
uralik Sep 26, 2024
4353c3d
Update projects/self_taught_evaluator/README.md
uralik Sep 26, 2024
c9b646c
Update projects/self_taught_evaluator/README.md
uralik Sep 26, 2024
3babf06
Update projects/self_taught_evaluator/src/utils.py
uralik Sep 26, 2024
d77db4f
Update projects/self_taught_evaluator/src/utils.py
uralik Sep 26, 2024
3666f35
Update projects/self_taught_evaluator/src/utils.py
uralik Sep 26, 2024
e36a20c
Update ram/data_utils.py
uralik Sep 26, 2024
51fb34f
Update projects/self_taught_evaluator/README.md
uralik Sep 26, 2024
797d20d
Update projects/self_taught_evaluator/README.md
uralik Sep 26, 2024
09a2b85
Update projects/self_taught_evaluator/README.md
uralik Sep 26, 2024
6d0b403
Update projects/self_taught_evaluator/run_rewardbench.sh
uralik Sep 26, 2024
9da2c6b
Update projects/self_taught_evaluator/run_inference_wvllm.sh
uralik Sep 26, 2024
13af802
Update projects/self_taught_evaluator/src/prepare_dpo_data.py
uralik Sep 26, 2024
17038db
Update projects/self_taught_evaluator/src/prepare_dpo_data.py
uralik Sep 26, 2024
3786ce0
Update projects/self_taught_evaluator/src/prepare_dpo_data.py
uralik Sep 26, 2024
4c0e090
Update projects/self_taught_evaluator/src/utils.py
uralik Sep 26, 2024
9a9f332
link update
Sep 26, 2024
5d59ad6
sorting
Sep 26, 2024
4165264
adding test scripts to preprocess hf dataset, adding fairseq2 asset c…
Sep 26, 2024
441d653
mentioning data processing script
Sep 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ repos:
- id: no-commit-to-branch
args: ['--branch', 'main']
- id: check-added-large-files
args: ['--maxkb=2000']
args: ['--maxkb=20000']
uralik marked this conversation as resolved.
Show resolved Hide resolved
- id: check-merge-conflict
- id: detect-aws-credentials
args: ['--allow-missing-credentials']
Expand Down
22 changes: 11 additions & 11 deletions projects/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Here we list projects undertaken in the RAM framework that are shared publicly,
- **Backtracking Improves Generation Safety** [[paper]](https://arxiv.org/abs/2409.14586).
_Trains LLMs to generate a RESET token if the partial-generation is bad._

- **Self-Taught Evaluators** [[paper]](https://arxiv.org/abs/2408.02666).
- **Self-Taught Evaluators** [[project]](./self_taught_evaluator).
_Improving LLM-as-a-Judge using iteratively generated synthetic data only (no human annotation)._

- **Source2Synth** [[paper]](https://arxiv.org/abs/2409.08239).
Expand All @@ -28,23 +28,23 @@ Here we list projects undertaken in the RAM framework that are shared publicly,

- **ToolVerifier** [[paper]](https://arxiv.org/abs/2402.14158).
_Generalization to New Tools via Self-Verification._

- **Chain-of-Verification Reduces Hallucination** [[paper]](https://arxiv.org/abs/2309.11495).
_Reduces hallucination by LLM self-identifying and verifying generated facts._

- **Branch-Solve-Merge** [[paper]](https://arxiv.org/abs/2310.15123).
_Reasoning method to improve LLM Evaluation and Generation._

- **Ask, Refine, Trust** [[paper]](https://arxiv.org/abs/2311.07961).
_Technique that uses critical questions to determine if an LLM generation needs refinement._


## Alignment

## Alignment

- **Meta-Rewarding LLMs** [[paper]](https://arxiv.org/abs/2407.19594)
_LLMs that can judge their own judgments to self-improve both acting & evaluating actions._

- **Iterative Reasoning Preference Optimization** [[paper]](https://arxiv.org/abs/2404.19733)
_Shows how to improve reasoning tasks with iterative DPO._

Expand All @@ -53,15 +53,15 @@ Here we list projects undertaken in the RAM framework that are shared publicly,

- **Self-Rewarding LLMs** [[paper]](https://arxiv.org/abs/2401.10020)
_Shows LLMs can judge themselves to self-improve without human feedback._

- **Iterative DPO & Cringe Loss** [[paper]](https://arxiv.org/abs/2312.16682)
_Shows iterative learning improves alignment._
_Shows iterative learning improves alignment._

- **Instruction Back-and-Forth Translation** [[paper]](https://arxiv.org/abs/2408.04614)
_Improves Instruction Backtranslation by rewriting the web document._
_Improves Instruction Backtranslation by rewriting the web document._

- **Instruction Backtranslation** [[paper]](https://arxiv.org/abs/2308.06259)
_Self-Alignment method by predicting instructions for web documents._
_Self-Alignment method by predicting instructions for web documents._

- **Leveraging Implicit Feedback** [[paper]](https://arxiv.org/abs/2307.14117)
_Method to learn from human feedback in dialogue deployment data to improve LLM._
Expand All @@ -74,7 +74,7 @@ Here we list projects undertaken in the RAM framework that are shared publicly,

- **Branch-Train-MiX** [[paper]](https://arxiv.org/abs/2403.07816)
_Novel MoE architecture that is very efficient during training._

- **Reverse Training** [[paper]](https://arxiv.org/abs/2403.13799)
_Method for pretraining that helps the reversal curse & improves performance._

Expand Down
85 changes: 85 additions & 0 deletions projects/self_taught_evaluator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Self-Taught Evaluators

<p align="center"><img width="90%" src="figures/self_taught_dpo.png" /></p>

Instructions and materials presented here correspond to the [Self-taught evaluators](https://arxiv.org/abs/2408.02666) research project.
uralik marked this conversation as resolved.
Show resolved Hide resolved

# Self-taught evaluator model release

**2024-09-26**

We release the self-taught evaluator model on hugging-face model repo: https://huggingface.co/facebook/Self-taught-evaluator-llama3.1-70B. This model is trained iteratively with supervised fine-tuning (SFT) and direct preference optimization (DPO).

## Inference and Evaluation

We provide example scripts to use the self-taught evaluator as a judge to choose a better response from a pair. We provide set of scripts to reproduce the RewardBench evaluation scores for this model. Please refer to [src/requirements.txt](./src/requirements.txt)

> [!IMPORTANT]
> This model was trained to judge a pair of responses using the specific prompt format from the RewardBench benchmark. Make sure to adopt the same prompt format when you run the model on your data.

#### Example: running the model with a given set of user inputs & pair of assistant outputs.

1. Prepare your inputs similar to ones found in [example_inputs.jsonl](./src/example_inputs.jsonl)
uralik marked this conversation as resolved.
Show resolved Hide resolved

2. Run the script [run_inference_wvllm.sh](./run_inference_wvllm.sh). The generated outputs and parsed judgements will be saved in `example_outputs.jsonl`.
uralik marked this conversation as resolved.
Show resolved Hide resolved

### Reproducing rewardbench evaluation score

1. Run `bash src/run_rewardbench.sh`.

2. Expected output:

```text
Chat Chat Hard Safety Reasoning
0.969 0.851 0.896 0.884

Final score: 90.014
```

## Synthetic Preference Data

The pre-processed training data for preference fine-tuning can be downloaded here: https://huggingface.co/datasets/facebook/Self-taught-evaluator-DPO-data

Below you can find instructions on how to replicate our data generation process.

### Generate worse response
1. Given pairs of (instruction, baseline response), prepare prompts using the template specified in `data/prompts/worse_response.prompt`.
2. Run generation on the prompts from step 1, to generate a "worse response" to the instruction.
uralik marked this conversation as resolved.
Show resolved Hide resolved
### Generate judgement
1. Given tuples of (instruction, baseline response, worse response), we generate judgement using the prompt template specified in `data/prompts/eval_plan.prompt`. To avoid position bias, we generate evaluation plans for both orders of the responses positions. Specifically, for `0_1` order, we prepare the prompt using (instruction, baseline response, worse response), and for `1_0` order, we prepare the prompt using (instruction, worse response, baseline response).
uralik marked this conversation as resolved.
Show resolved Hide resolved
2. Run generation on both `0_1` and `1_0` ordered prompts from step 1 to derive evaluation plans for pairwise preference.
3. Then we apply rejection sampling, where we collect multiple samples of evaluation plan, and only retain examples where the judgement prefers the baseline response to the worse response. To ensure label balance, we retain the same number of examples of `A is better` and `B is better`.

### Generation hyper-parameters

The experiments in the paper used vllm for generation, with temperature=0.7, and top_p=0.9, max_tokens=4096.

### Prepare training data
uralik marked this conversation as resolved.
Show resolved Hide resolved
After generating samples of judgement (e.g. using vllm), run `python src/prepare_sft_data.py` and `python src/prepare_dpo_data.py` to prepare the training data.

## Model training details

Model were trained using the preference optimization recipe from the open-source [fairseq2 library](https://github.com/facebookresearch/fairseq2). Training was executed on SLURM-based cluster using multi-node A100 setup: 3 nodes training for first iteration SFT model and 8 nodes training for the second iteration DPO model that was released. Model selection is done via early stopping based on the pairwise judgement accuracy computed over the helpsteer2 validation set.
uralik marked this conversation as resolved.
Show resolved Hide resolved

**SFT training config and example run command**

Config: [sft_training.yaml](./training_configs/sft_training.yaml)

Run command (within SLURM allocation): `srun fairseq2 lm instruction_finetune ${SAVE_DIR} --config-file ./training_configs/sft_training.yaml`

**DPO training config and example run command**

Config: [dpo_training.yaml](./training_configs/dpo_training.yaml)

Run command (within SLURM allocation): `srun fairseq2 lm preference_finetune ${SAVE_DIR} --config-file ./training_configs/dpo_training.yaml`

## Citation
If you use data, model, or code from this work, please cite with the following BibTex entry:
```
@article{wang2024self,
title={Self-taught evaluators},
author={Wang, Tianlu and Kulikov, Ilia and Golovneva, Olga and Yu, Ping and Yuan, Weizhe and Dwivedi-Yu, Jane and Pang, Richard Yuanzhe and Fazel-Zarandi, Maryam and Weston, Jason and Li, Xian},
journal={arXiv preprint arXiv:2408.02666},
year={2024}
}
```
Loading
Loading