Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing StackLLaMA #401

Closed
mnoukhov opened this issue Jun 1, 2023 · 19 comments
Closed

Reproducing StackLLaMA #401

mnoukhov opened this issue Jun 1, 2023 · 19 comments

Comments

@mnoukhov
Copy link
Contributor

mnoukhov commented Jun 1, 2023

I've reproduced the whole StackLLaMA pipeline using the changes in #398 #399 #400

Here is the corresponding wandb report

A couple notes:

  • As my base llama I used huggyllama/llama-7b
  • My supervised ft run was better the in the blog post, reaching a lower ppl
  • My reward modelling run was worse than the blog post (67%) and only reached 63% after one epoch. So I ran it for two epochs and got ~66% which I felt was sufficient
  • The RL training curves look very similar. I found that I could achieve similar performance with a lower KL coefficient (0.02) in less training time, 600 epochs vs 1200 but still have the original KL coefficient run (0.2)

I've also published my adapter weights on the hub
https://huggingface.co/mnoukhov/llama-7b-se-peft
https://huggingface.co/mnoukhov/llama-7b-se-rm-peft
https://huggingface.co/mnoukhov/llama-7b-se-rl-peft

Use the merge_peft script in #398 to merge huggyllama/llama-7b and llama-7b-se-peft to make llama-7b-se
Then merge llama-7b-se with llama-7b-se-rm-peft to make the reward model and llama-7b-se-rl-peft to the make StackLLaMA

@younesbelkada
Copy link
Contributor

Amazing work @mnoukhov !!
Will review the PRs asap, as a side note, it seems that I can't see the figures on the wandb report :/
Screenshot 2023-06-02 at 14 29 33
Also, could you confirm with us which versions of the libraries did you used?
Thanks a lot!

@mnoukhov
Copy link
Contributor Author

mnoukhov commented Jun 4, 2023

Sorry, I moved the runs to the workspace so the graphs should be fixed.

My libraries are

accelerate==0.18.0
evaluate==0.4.0
huggingface-hub==0.13.3
torch==2.0.0
transformers==4.28.1

and the latest version of trl built from source

@dh2shin
Copy link

dh2shin commented Jun 24, 2023

Hi @mnoukhov , could you explain when/what exactly to merge? I'm following the readme and would really appreciate your help. Specifically, when you say to merge huggyllama/llama-7b and llama-7b-se-peft to create llama-7b-se, is llama-7b-se-peft referring to the model outputted after running Step 1 (with huggyllama/llama-7b)?
And when you say then to merge llama-7b-se with llama-7b-se-rm-peft to create the reward model, does llama-7b-se-rm-peft refer to the model outputted after running Step 2 (with llama-7b-se)?

@mnoukhov
Copy link
Contributor Author

That's correct.

base: huggyllama + llama-7b-se-peft = llama-7b-se
base llama-7b-se + llama-7b-se-rm-peft = llama-7b-se-rm
base llama-7b-se + llama-7b-se-rl-peft = llama-7b-se-rl

@dh2shin
Copy link

dh2shin commented Jun 26, 2023

Do you mind sharing the arguments / shell script you used for each step? I'm using what's listed on the repo, having memory requirement issues, which seems odd given peft+lora.

@mnoukhov
Copy link
Contributor Author

I use the same hyperparameters as those listed with the slight change that I am running on 4 GPUs instead of 8 so I change gradient accumulation steps from 4 to 8. I find that I need ~40GB of GPU memory for the RL finetuning step but it goes back and forth and can get as high as 60 per GPU. Currently working on #436 which should reduce memory requirements to allow for 32GB GPU training

@dh2shin
Copy link

dh2shin commented Jun 27, 2023

I'm experiencing similar effects, about ~40GB of GPU memory is needed for the second step of training the Reward Model too, right?

@mnoukhov
Copy link
Contributor Author

You can check the exact memory and compute things by looking at the runs linked in the wandb report (e.g. my RLHF run shows the memory consumption in the "System" charts)

Given that I've essentially repro'd the results and my PRs have been merged, I'm closing this issue and continuing in #471 with repro using the more compute-efficient multi-adapter paradigm. Feel free to keep commmenting about the repro and I'll try to respond.

@dh2shin
Copy link

dh2shin commented Jun 29, 2023

Hi Michael, continuing the conversation from #401 here. When I try to run the supervised finetuning script out of the box, I get the following warning messages:
Training...
24 Using pad_token, but it is not set yet.
25 UserWarning: You passed packing=Trueto the SFTTrainer, and you are training your model withmax_stepsstrategy. The dataset will be iterated until themax_steps are reached.
26 warnings.warn(
27 FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
28 warnings.warn(
29 Token indices sequence length is longer than the specified maximum sequence length for this model (4899 > 2048). Running this sequence through the model will result in indexing errors
30 You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the callmethod is faster than using a method to encode the text followed by a call to thepad method to get a padded encoding.
31 use_cache=Trueis incompatible with gradient checkpointing. Settinguse_cache=False`...``

I'm getting a sudden spike in train loss around 800 global steps in, and I'm wondering if these warning messages have to do with them. Any ideas?

@mnoukhov
Copy link
Contributor Author

None of the messages are related. There are other issues related to instability in training, and without more info it's hard to diagnose the problem. If you want some advice specific to your situation it would be helpful to know lots of things like reward, KL, etc...

If you just want to try some things out #462 found that setting a larger minibatch size and larger target KL could help

@dh2shin
Copy link

dh2shin commented Jul 5, 2023

@mnoukhov Hi Michael, when using the reward model to build out the sentiment pipeline in rl_training.py, the output rewards/scores vary drastically depending on the batch_size used in sent_kwargs. I am wondering if you have investigated this issue more in-depth.

I'm also wondering if in reward_modeling.py, whether the padding_strategy should be True or max_length. Currently, I get the following error message:
UserWarning: max_length is ignored when padding=True and there is no truncation strategy. To pad to max length, use padding='max_length'.

@wangzhao88
Copy link

wangzhao88 commented Aug 3, 2023

Hi, I used your RL model (https://huggingface.co/mnoukhov/llama-7b-se-rl-peft) to test the SuperGLUE benchmark using the lm-evaluation-harness. The results are as follows:

Original Model Results:
| Task | Version | Metric | Value | | Stderr |
| boolq | 1 | acc | 0.7642 | ± | 0.0074 |
| cb | 1 | acc | 0.5536 | ± | 0.0670 |
| | | f1 | 0.4248 | | |
| copa | 0 | acc | 0.8800 | ± | 0.0500 |
| multirc | 1 | acc | 0.0084 | ± | 0.0018 |
| record | 0 | f1 | 0.9119 | ± | 0.0032 |
| | | em | 0.9044 | ± | 0.0032 |
| rte | 0 | acc | 0.6282 | ± | 0.0301 |
| wic | 0 | acc | 0.4953 | ± | 0.0198 |
| wsc | 0 | acc | 0.5673 | ± | 0.0474 |

However, when I trained my own RL model using the following command:
'''
accelerate launch --multi_gpu --num_machines 1 --num_processes 8 rl_training.py --model_name=【mnoukhov se model】 --reward_model_name=【mnoukhov rm model】 --adafactor=False --tokenizer_name=【mnoukhov se model】 --save_freq=100 --output_max_length=128 --batch_size=8 --gradient_accumulation_steps=8 --batched_gen=True --ppo_epochs=4 --seed=0 --learning_rate=1.4e-5 --early_stopping=True --output_dir=llama-se-rl-finetune-128-8-8-1.4e-5_adam
'''

After one day of training, the result of my own RL model (llama-se-rl-finetune-128-8-8-1.4e-5_adam, trained up to step 1300) is as follows:

Trained Model Results:
| Task | Version | Metric | Value | | Stderr |
| boolq | 1 | acc | 0.3783 | ± | 0.0085 |
| cb | 1 | acc | 0.4107 | ± | 0.0663 |
| | | f1 | 0.1941 | | |
| copa | 0 | acc | 0.5500 | ± | 0.0500 |
| multirc | 1 | acc | 0.0031 | ± | 0.0018 |
| record | 0 | f1 | 0.1186 | ± | 0.0032 |
| | | em | 0.1151 | ± | 0.0032 |
| rte | 0 | acc | 0.5271 | ± | 0.0301 |
| wic | 0 | acc | 0.5000 | ± | 0.0198 |
| wsc | 0 | acc | 0.6346 | ± | 0.0474 |

It appears that the training of the model did not achieve the desired performance.

Note: I made some assumptions in the text, such as specifying the link to the lm-evaluation-harness repository and providing some context based on the provided information. If there are any inaccuracies or specific information you would like to include, please let me know, and I can adjust the text accordingly.

@lvwerra
Copy link
Member

lvwerra commented Aug 3, 2023

Can you share the logs from the RL training? E.g. mean rewards and objective/kl are usually helpful metrics to look at to see if the model learned something.

@wangzhao88
Copy link

wangzhao88 commented Aug 3, 2023

Can you share the logs from the RL training? E.g. mean rewards and objective/kl are usually helpful metrics to look at to see if the model learned something.

hi, here is the logs:https://wandb.ai/630191510/trl/runs/eb02d7zh?workspace=user-630191510

@lvwerra
Copy link
Member

lvwerra commented Aug 3, 2023

Looks like there was an issue at step ~50: the reward went down significantly. Could you try with a different seed or lower learning rate? Also we added some stability measures in the latest release so try updating trl.

@wangzhao88
Copy link

Hello! Here is the latest log: https://wandb.ai/630191510/trl/runs/kj2kkbq9?workspace=user-630191510

The loss curve looks normal, and the accuracy in SuperGLUE is also normal.

I suppose the key difference in ppo_trainer.py between the two branches is crucial.

@lvwerra
Copy link
Member

lvwerra commented Aug 7, 2023

That's great! So updating helped?

@wangzhao88
Copy link

wangzhao88 commented Aug 8, 2023

Can you tell me the differences between Branch[d78d917] and Branch[e448bb6] in terms of PPO training?
Their ppo_trainer.py are very similar.

@zhangfudiyi
Copy link

it‘s great jobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants