Skip to content

Commit

Permalink
Update README.md (#8)
Browse files Browse the repository at this point in the history
  • Loading branch information
jaseweston authored Sep 26, 2024
1 parent 65b6833 commit 7ba9dfd
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions projects/self_taught_evaluator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@

<p align="center"><img width="90%" src="figures/self_taught_dpo.png" /></p>

Instructions and materials presented here correspond to the [Self-taught evaluators](https://arxiv.org/abs/2408.02666) research paper.
Instructions and materials presented here correspond to the [Self-Taught Evaluators](https://arxiv.org/abs/2408.02666) research paper.

# Self-taught evaluator model release
# Self-Taught Evaluator model release

**2024-09-26**

We release the self-taught evaluator model on hugging-face model repo: https://huggingface.co/facebook/Self-taught-evaluator-llama3.1-70B. This model is trained iteratively with supervised fine-tuning (SFT) and direct preference optimization (DPO).
We release the Self-Taught Evaluator model via the Hugging-Face model repo: https://huggingface.co/facebook/Self-taught-evaluator-llama3.1-70B. This model is trained iteratively with supervised fine-tuning (SFT) and direct preference optimization (DPO).

## Inference and Evaluation

We provide example scripts to use the self-taught evaluator as a judge to choose a better response from a pair. We provide set of scripts to reproduce the RewardBench evaluation scores for this model. Please refer to [src/requirements.txt](./src/requirements.txt)
We provide example scripts to use the Self-Taught Evaluator as a judge to choose a better response from a pair. We provide a set of scripts to reproduce the RewardBench evaluation scores for this model. Please refer to [src/requirements.txt](./src/requirements.txt)

> [!IMPORTANT]
> This model was trained to judge a pair of responses using the specific prompt format from the RewardBench benchmark. Make sure to adopt the same prompt format when you run the model on your data.
Expand All @@ -21,13 +21,13 @@ We provide example scripts to use the self-taught evaluator as a judge to choose

Note: download example eval data here: https://dl.fbaipublicfiles.com/self_taught_evaluator/example_inputs.jsonl

1. Prepare your inputs similar to ones found in [example_inputs.jsonl](https://dl.fbaipublicfiles.com/self_taught_evaluator/example_inputs.jsonl)
1. Prepare your inputs similar to the ones found in [example_inputs.jsonl](https://dl.fbaipublicfiles.com/self_taught_evaluator/example_inputs.jsonl)

2. Run `bash run_inference_wvllm.sh`. The generated outputs and parsed judgements will be saved in `example_outputs.jsonl`.

### Reproducing rewardbench evaluation score

Note: download eval data here: https://dl.fbaipublicfiles.com/self_taught_evaluator/rewardbench_inputs.jsonl
Note: download the evaluation data here: https://dl.fbaipublicfiles.com/self_taught_evaluator/rewardbench_inputs.jsonl

1. Run `bash src/run_rewardbench.sh`.

Expand Down Expand Up @@ -63,11 +63,11 @@ The experiments in the paper used vllm for generation, with temperature=0.7, and

### Prepare training data

After generating samples of judgement (e.g. using vllm), run `python src/prepare_sft_data.py` and `python src/prepare_dpo_data.py` to prepare the training data.
After generating samples of judgements (e.g. using vllm), run `python src/prepare_sft_data.py` and `python src/prepare_dpo_data.py` to prepare the training data.

## Model training details

Models were trained using the preference optimization recipe using the open-source [fairseq2 library](https://github.com/facebookresearch/fairseq2). Training was executed on SLURM-based cluster using multi-node A100 setup: 3 nodes training for first iteration SFT model and 8 nodes training for the second iteration DPO model that was released. Model selection is done via early stopping based on the pairwise judgement accuracy computed over the helpsteer2 validation set.
Models were trained using the preference optimization recipe using the open-source [fairseq2 library](https://github.com/facebookresearch/fairseq2). Training was executed on a SLURM-based cluster using multi-node A100 setup: 3 nodes training for first iteration SFT model and 8 nodes training for the second iteration DPO model that was released. Model selection is done via early stopping based on the pairwise judgement accuracy computed over the Helpsteer2 validation set.

**SFT training config and example run command**

Expand All @@ -82,10 +82,10 @@ Config: [dpo_training.yaml](./training_configs/dpo_training.yaml)
Run command (within SLURM allocation): `srun fairseq2 lm preference_finetune ${SAVE_DIR} --config-file ./training_configs/dpo_training.yaml`

## Citation
If you use data, model, or code from this work, please cite with the following BibTex entry:
If you use the data, model, or code from this work, please cite with the following BibTex entry:
```
@article{wang2024self,
title={Self-taught evaluators},
title={Self-Taught Evaluators},
author={Wang, Tianlu and Kulikov, Ilia and Golovneva, Olga and Yu, Ping and Yuan, Weizhe and Dwivedi-Yu, Jane and Pang, Richard Yuanzhe and Fazel-Zarandi, Maryam and Weston, Jason and Li, Xian},
journal={arXiv preprint arXiv:2408.02666},
year={2024}
Expand Down

0 comments on commit 7ba9dfd

Please sign in to comment.