Update README.md

facebookresearch · Sep 23, 2024 · dfcd8f4 · dfcd8f4
1 parent f86ca84
commit dfcd8f4
Showing 1 changed file with 5 additions and 5 deletions.
diff --git a/projects/self_taught_evaluator/README.md b/projects/self_taught_evaluator/README.md
@@ -1,17 +1,17 @@
 # Self-Taught Evaluators
 
-<p align="center"><img width="110%" src="figures/self_taught_sft.pdf" /></p>
+<p align="center"><img width="90%" src="figures/self_taught_sft.png" /></p>
 
 ## Inference and Evaluation
 Coming soon.
 
 ## Synthetic Preference Data
 ### Generate worse response
-1. Given pairs of (instruction, response), prepare prompts using the template specified in `data/prompts/worse_response.prompt`.
-2. Run generation on the prompts from step 1, to generate a worse response to the instruction. 
+1. Given pairs of (instruction, baseline response), prepare prompts using the template specified in `data/prompts/worse_response.prompt`.
+2. Run generation on the prompts from step 1, to generate a "worse response" to the instruction. 
 ### Generate judgement 
-1. Given tuples of (instruction, response, worse response), prepare prompts using the template specified in `data/prompts/eval_plan.prompt`.
-2. Run generation on the prompts from step 1 to derive evaluation plans for pairwise preference.  
+1. Given tuples of (instruction, baseline response, worse response), prepare prompts using the template specified in `data/prompts/eval_plan.prompt`.
+2. Run generation on the prompts from step 1 to derive evaluation plans for pairwise preference. Then we apply rejection sampling, where we collect multiple samples of evaluation plan, and only retain examples where the judgement prefers the baseline response to the worse response. 
 
 The experiments in the paper used sampling with temperature=0.7, and top_p=0.9.
 ## Model Training