Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to reproduce repllama performance #129

Closed
amy-hyunji opened this issue Jun 5, 2024 · 19 comments
Closed

unable to reproduce repllama performance #129

amy-hyunji opened this issue Jun 5, 2024 · 19 comments

Comments

@amy-hyunji
Copy link

Hello, thanks for sharing great work!

I tried trying repllama myself with the repllama branch but failed to reproduce the numbers.
Could you check whether any of my hyperparameters are wrong? I add the training script underneath
I am currently running on 8 A100 and the result I got is NDCG@10: 0.3959, NDCG@100: 0.4515
When I download the released model from hf I get the number in the paper, which I assume the issue is from the training not evaluation.

Thanks :)

deepspeed --master_port 40000 train.py \
  --deepspeed "ds_config.json" \
  --output_dir "model_repllama_lora_train.7b.re" \
  --model_name_or_path "meta-llama/Llama-2-7b-hf" \
  --save_steps 500 \
  --dataset_name "Tevatron/msmarco-passage" \
  --bf16 \
  --per_device_train_batch_size 16 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --train_n_passages 16 \
  --learning_rate 1e-4 \
  --q_max_len 32 \
  --p_max_len 196 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --dataset_proc_num 32 \
  --negatives_x_device \
  --warmup_steps 100 \
@MXueguang
Copy link
Contributor

for reproducing repllama training, I'd suggest using the main branch and using the latest code

deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-mistral \
  --model_name_or_path mistralai/Mistral-7B-v0.1 \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 8 \
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --gradient_accumulation_steps 4

you can either use the mistral as initialization for a little higher effectiveness or use llama2 for reproducing.

Notice that the training data Tevatron/msmarco-passage-aug is the data used to train repllama

@amy-hyunji
Copy link
Author

Hi,

Thank you for the reply!
I tried training myself with the script but still had trouble with the reproduction.
I am currently training on top of 8 A100 80G so I changed the per_device_train_batch_size and gradient_accumulation_steps for faster training. Would this cause a problem? I considered # of GPU * per_device_train_batch_size * gradient_accumulation_steps to be the same as the reported number 128.
Below I added the script I used!

deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-mistral \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  **--per_device_train_batch_size 16 \**
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  **--gradient_accumulation_steps 1**

@MXueguang
Copy link
Contributor

I would suggest keep the accumulation steps same as before.
If i remember correctly, @ArvinZhuang previously found finetune with 128 batch size directly may has some unstable loss and may need further tuning the temperature.
@ArvinZhuang correct me if I am wrong.

@ArvinZhuang
Copy link
Contributor

Hi @MXueguang @amy-hyunji, Yes I tried a similar training config as @amy-hyunji which set gradient_accumulation_steps to 1 and it won't work well..., adding back gradient_accumulation_steps to 4 will reproduce the results.

Note gradient_accumulation_steps will affect the number of negatives per training example,
For example, if set per_device_train_batch_size 8 and gradient_accumulation_steps 4 with 4 gpus, this will result in a total batch size of 128 and the number of negatives per example is 4 * 8 * 16 = 512.
If per_device_train_batch_size to 32, gradient_accumulation_steps 1 with 4 gpus. the batch size is also 128 but the number of negatives will be 4 * 32 * 16 = 2048 (@MXueguang correct me if I'm wrong).

This is odd to me as well as the experience from the literature is more negative is better...

@riyajatar37003
Copy link

riyajatar37003 commented Jun 18, 2024

Traceback (most recent call last):
File "/tmp/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3342, in save_model
self._save(output_dir, state_dict=state_dict)
File "/tmp/.local/lib/python3.10/site-packages/tevatron/retriever/trainer.py", line 31, in _save
raise ValueError(f"Unsupported model class {self.model}")
ValueError: Unsupported model class DenseModel(

why i am getting this error , during saviung a checkpoint

@ArvinZhuang
Copy link
Contributor

ArvinZhuang commented Jun 18, 2024

Hi @riyajatar37003 , is your DenseModel a class of DenseModel(EncoderModel):? The save only support tevatron EncoderModel here, if the DenseModel is your own implemented model then you may need to add into the supported model list.

@riyajatar37003
Copy link

what is mean by train group size and how it important

@orionw
Copy link

orionw commented Jun 25, 2024

@MXueguang I also had a similar experience with failing to reproduce using that script. Using your suggested config above I get the 72.95 nDCG for DL19 and 70.6 nDCG@10 for DL20 (compared to the paper's 74.3 and 72.1). I used the gradient steps parameters (batch size of 8, gradient accumulation of 4) suggested by @ArvinZhuang.

Are you able to reproduce it with that code or did the recent updates to tevatron change some parameter that lowers performance? If I wanted to exactly reproduce, do you have the command that works with the November codebase?

@ArvinZhuang
Copy link
Contributor

hi @orionw, which base llm were you using? llama2?

@orionw
Copy link

orionw commented Jun 25, 2024

Yes, llama2

Full config (if helpful):

#!/bin/bash
deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-llama2 \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 200 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 8 \
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 196 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --warmup_steps 100 \
  --gradient_accumulation_steps 4

@MXueguang
Copy link
Contributor

do you have msmarco dev set results of the ckpt you get, does it matches?

@orionw
Copy link

orionw commented Jun 25, 2024

I didn't evaluate dev, I stopped at these two for compute reasons.

@MXueguang
Copy link
Contributor

I see. I'll schedule a training to see how it goes on my end...

@orionw
Copy link

orionw commented Jun 25, 2024

Thanks for looking into it!

@MXueguang
Copy link
Contributor

one difference I observed is the lora_r parameters, it was set to 32 in original experiment, now the default one is 8. I am seeing if this would affect trec dl results.

@orionw
Copy link

orionw commented Jun 27, 2024

Hmm, could be @MXueguang! Let me know if that fixes it!

I was curious about potential discrepancies with the number of GPUs/batch size? I don't know the exact command you used, but your paper said it was trained with 16 V100 GPUs. Perhaps having the batches distributed larger makes it better (like @ArvinZhuang was saying).

Could also be related to the cross device negatives/group size (did that logic change in the recent version?). I unfortunately don't have access to a node with 16 GPUs to test it out on.

@MXueguang
Copy link
Contributor

yeah @orionw , I. am running it, but still need a day or so to get a number due to limited compute...
In the original 16 V100 GPUs, --per_device_train_batch_size is set to 2. so the setting regarding batch size should be equivalent here, and so does the cross device negatives.

@MXueguang
Copy link
Contributor

Hi @orionw, my reproduce with lora_r=32, others keeps same, gives:
dev mrr@10: 41.6
dl19 ndcg@10: 74.6
dl20 ndcg@10: 71.4

dev/dl19 a bit higher and dl20 a bit lower than original experiments.

@orionw
Copy link

orionw commented Jul 3, 2024

Awesome, thank you so much @MXueguang! Those differences could easily be due to random seeds. Really appreciate you looking into it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants