unable to reproduce repllama performance #129

amy-hyunji · 2024-06-05T01:31:45Z

Hello, thanks for sharing great work!

I tried trying repllama myself with the repllama branch but failed to reproduce the numbers.
Could you check whether any of my hyperparameters are wrong? I add the training script underneath
I am currently running on 8 A100 and the result I got is NDCG@10: 0.3959, NDCG@100: 0.4515
When I download the released model from hf I get the number in the paper, which I assume the issue is from the training not evaluation.

Thanks :)

deepspeed --master_port 40000 train.py \
  --deepspeed "ds_config.json" \
  --output_dir "model_repllama_lora_train.7b.re" \
  --model_name_or_path "meta-llama/Llama-2-7b-hf" \
  --save_steps 500 \
  --dataset_name "Tevatron/msmarco-passage" \
  --bf16 \
  --per_device_train_batch_size 16 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --train_n_passages 16 \
  --learning_rate 1e-4 \
  --q_max_len 32 \
  --p_max_len 196 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --dataset_proc_num 32 \
  --negatives_x_device \
  --warmup_steps 100 \

The text was updated successfully, but these errors were encountered:

MXueguang · 2024-06-05T01:44:23Z

for reproducing repllama training, I'd suggest using the main branch and using the latest code

deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-mistral \
  --model_name_or_path mistralai/Mistral-7B-v0.1 \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 8 \
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --gradient_accumulation_steps 4

you can either use the mistral as initialization for a little higher effectiveness or use llama2 for reproducing.

Notice that the training data Tevatron/msmarco-passage-aug is the data used to train repllama

amy-hyunji · 2024-06-08T19:45:00Z

Hi,

Thank you for the reply!
I tried training myself with the script but still had trouble with the reproduction.
I am currently training on top of 8 A100 80G so I changed the per_device_train_batch_size and gradient_accumulation_steps for faster training. Would this cause a problem? I considered # of GPU * per_device_train_batch_size * gradient_accumulation_steps to be the same as the reported number 128.
Below I added the script I used!

deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-mistral \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  **--per_device_train_batch_size 16 \**
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  **--gradient_accumulation_steps 1**

MXueguang · 2024-06-09T04:56:09Z

I would suggest keep the accumulation steps same as before.
If i remember correctly, @ArvinZhuang previously found finetune with 128 batch size directly may has some unstable loss and may need further tuning the temperature.
@ArvinZhuang correct me if I am wrong.

ArvinZhuang · 2024-06-09T12:13:55Z

Hi @MXueguang @amy-hyunji, Yes I tried a similar training config as @amy-hyunji which set gradient_accumulation_steps to 1 and it won't work well..., adding back gradient_accumulation_steps to 4 will reproduce the results.

Note gradient_accumulation_steps will affect the number of negatives per training example,
For example, if set per_device_train_batch_size 8 and gradient_accumulation_steps 4 with 4 gpus, this will result in a total batch size of 128 and the number of negatives per example is 4 * 8 * 16 = 512.
If per_device_train_batch_size to 32, gradient_accumulation_steps 1 with 4 gpus. the batch size is also 128 but the number of negatives will be 4 * 32 * 16 = 2048 (@MXueguang correct me if I'm wrong).

This is odd to me as well as the experience from the literature is more negative is better...

riyajatar37003 · 2024-06-18T16:15:15Z

Traceback (most recent call last):
File "/tmp/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3342, in save_model
self._save(output_dir, state_dict=state_dict)
File "/tmp/.local/lib/python3.10/site-packages/tevatron/retriever/trainer.py", line 31, in _save
raise ValueError(f"Unsupported model class {self.model}")
ValueError: Unsupported model class DenseModel(

why i am getting this error , during saviung a checkpoint

ArvinZhuang · 2024-06-18T22:06:36Z

Hi @riyajatar37003 , is your DenseModel a class of DenseModel(EncoderModel):? The save only support tevatron EncoderModel here, if the DenseModel is your own implemented model then you may need to add into the supported model list.

riyajatar37003 · 2024-06-19T08:55:52Z

what is mean by train group size and how it important

orionw · 2024-06-25T22:04:28Z

@MXueguang I also had a similar experience with failing to reproduce using that script. Using your suggested config above I get the 72.95 nDCG for DL19 and 70.6 nDCG@10 for DL20 (compared to the paper's 74.3 and 72.1). I used the gradient steps parameters (batch size of 8, gradient accumulation of 4) suggested by @ArvinZhuang.

Are you able to reproduce it with that code or did the recent updates to tevatron change some parameter that lowers performance? If I wanted to exactly reproduce, do you have the command that works with the November codebase?

ArvinZhuang · 2024-06-25T22:17:58Z

hi @orionw, which base llm were you using? llama2?

orionw · 2024-06-25T22:18:53Z

Yes, llama2

Full config (if helpful):

#!/bin/bash
deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-llama2 \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 200 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 8 \
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 196 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --warmup_steps 100 \
  --gradient_accumulation_steps 4

MXueguang · 2024-06-25T22:48:24Z

do you have msmarco dev set results of the ckpt you get, does it matches?

orionw · 2024-06-25T22:50:36Z

I didn't evaluate dev, I stopped at these two for compute reasons.

MXueguang · 2024-06-25T23:11:04Z

I see. I'll schedule a training to see how it goes on my end...

orionw · 2024-06-25T23:23:15Z

Thanks for looking into it!

MXueguang · 2024-06-26T02:44:02Z

one difference I observed is the lora_r parameters, it was set to 32 in original experiment, now the default one is 8. I am seeing if this would affect trec dl results.

orionw · 2024-06-27T19:12:50Z

Hmm, could be @MXueguang! Let me know if that fixes it!

I was curious about potential discrepancies with the number of GPUs/batch size? I don't know the exact command you used, but your paper said it was trained with 16 V100 GPUs. Perhaps having the batches distributed larger makes it better (like @ArvinZhuang was saying).

Could also be related to the cross device negatives/group size (did that logic change in the recent version?). I unfortunately don't have access to a node with 16 GPUs to test it out on.

MXueguang · 2024-06-30T04:49:32Z

yeah @orionw , I. am running it, but still need a day or so to get a number due to limited compute...
In the original 16 V100 GPUs, --per_device_train_batch_size is set to 2. so the setting regarding batch size should be equivalent here, and so does the cross device negatives.

MXueguang · 2024-07-03T01:08:07Z

Hi @orionw, my reproduce with lora_r=32, others keeps same, gives:
dev mrr@10: 41.6
dl19 ndcg@10: 74.6
dl20 ndcg@10: 71.4

dev/dl19 a bit higher and dl20 a bit lower than original experiments.

orionw · 2024-07-03T01:18:14Z

Awesome, thank you so much @MXueguang! Those differences could easily be due to random seeds. Really appreciate you looking into it :)

amy-hyunji closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to reproduce repllama performance #129

unable to reproduce repllama performance #129

amy-hyunji commented Jun 5, 2024

MXueguang commented Jun 5, 2024

amy-hyunji commented Jun 8, 2024

MXueguang commented Jun 9, 2024

ArvinZhuang commented Jun 9, 2024

riyajatar37003 commented Jun 18, 2024 •

edited

Loading

ArvinZhuang commented Jun 18, 2024 •

edited

Loading

riyajatar37003 commented Jun 19, 2024

orionw commented Jun 25, 2024 •

edited

Loading

ArvinZhuang commented Jun 25, 2024

orionw commented Jun 25, 2024 •

edited

Loading

MXueguang commented Jun 25, 2024

orionw commented Jun 25, 2024

MXueguang commented Jun 25, 2024

orionw commented Jun 25, 2024

MXueguang commented Jun 26, 2024

orionw commented Jun 27, 2024

MXueguang commented Jun 30, 2024

MXueguang commented Jul 3, 2024

orionw commented Jul 3, 2024

unable to reproduce repllama performance #129

unable to reproduce repllama performance #129

Comments

amy-hyunji commented Jun 5, 2024

MXueguang commented Jun 5, 2024

amy-hyunji commented Jun 8, 2024

MXueguang commented Jun 9, 2024

ArvinZhuang commented Jun 9, 2024

riyajatar37003 commented Jun 18, 2024 • edited Loading

ArvinZhuang commented Jun 18, 2024 • edited Loading

riyajatar37003 commented Jun 19, 2024

orionw commented Jun 25, 2024 • edited Loading

ArvinZhuang commented Jun 25, 2024

orionw commented Jun 25, 2024 • edited Loading

MXueguang commented Jun 25, 2024

orionw commented Jun 25, 2024

MXueguang commented Jun 25, 2024

orionw commented Jun 25, 2024

MXueguang commented Jun 26, 2024

orionw commented Jun 27, 2024

MXueguang commented Jun 30, 2024

MXueguang commented Jul 3, 2024

orionw commented Jul 3, 2024

riyajatar37003 commented Jun 18, 2024 •

edited

Loading

ArvinZhuang commented Jun 18, 2024 •

edited

Loading

orionw commented Jun 25, 2024 •

edited

Loading

orionw commented Jun 25, 2024 •

edited

Loading