Reproducing StackLLaMA #401

mnoukhov · 2023-06-01T21:29:34Z

I've reproduced the whole StackLLaMA pipeline using the changes in #398 #399 #400

A couple notes:

As my base llama I used huggyllama/llama-7b
My supervised ft run was better the in the blog post, reaching a lower ppl
My reward modelling run was worse than the blog post (67%) and only reached 63% after one epoch. So I ran it for two epochs and got ~66% which I felt was sufficient
The RL training curves look very similar. I found that I could achieve similar performance with a lower KL coefficient (0.02) in less training time, 600 epochs vs 1200 but still have the original KL coefficient run (0.2)

I've also published my adapter weights on the hub
https://huggingface.co/mnoukhov/llama-7b-se-peft
https://huggingface.co/mnoukhov/llama-7b-se-rm-peft
https://huggingface.co/mnoukhov/llama-7b-se-rl-peft

Use the merge_peft script in #398 to merge huggyllama/llama-7b and llama-7b-se-peft to make llama-7b-se
Then merge llama-7b-se with llama-7b-se-rm-peft to make the reward model and llama-7b-se-rl-peft to the make StackLLaMA

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-06-02T12:29:53Z

Amazing work @mnoukhov !!
Will review the PRs asap, as a side note, it seems that I can't see the figures on the wandb report :/

Also, could you confirm with us which versions of the libraries did you used?
Thanks a lot!

mnoukhov · 2023-06-04T03:55:12Z

Sorry, I moved the runs to the workspace so the graphs should be fixed.

My libraries are

accelerate==0.18.0
evaluate==0.4.0
huggingface-hub==0.13.3
torch==2.0.0
transformers==4.28.1

and the latest version of trl built from source

dh2shin · 2023-06-24T00:04:05Z

Hi @mnoukhov , could you explain when/what exactly to merge? I'm following the readme and would really appreciate your help. Specifically, when you say to merge huggyllama/llama-7b and llama-7b-se-peft to create llama-7b-se, is llama-7b-se-peft referring to the model outputted after running Step 1 (with huggyllama/llama-7b)?
And when you say then to merge llama-7b-se with llama-7b-se-rm-peft to create the reward model, does llama-7b-se-rm-peft refer to the model outputted after running Step 2 (with llama-7b-se)?

mnoukhov · 2023-06-24T15:13:27Z

That's correct.

base: huggyllama + llama-7b-se-peft = llama-7b-se
base llama-7b-se + llama-7b-se-rm-peft = llama-7b-se-rm
base llama-7b-se + llama-7b-se-rl-peft = llama-7b-se-rl

dh2shin · 2023-06-26T17:26:26Z

Do you mind sharing the arguments / shell script you used for each step? I'm using what's listed on the repo, having memory requirement issues, which seems odd given peft+lora.

mnoukhov · 2023-06-27T16:41:47Z

I use the same hyperparameters as those listed with the slight change that I am running on 4 GPUs instead of 8 so I change gradient accumulation steps from 4 to 8. I find that I need ~40GB of GPU memory for the RL finetuning step but it goes back and forth and can get as high as 60 per GPU. Currently working on #436 which should reduce memory requirements to allow for 32GB GPU training

dh2shin · 2023-06-27T17:40:29Z

I'm experiencing similar effects, about ~40GB of GPU memory is needed for the second step of training the Reward Model too, right?

mnoukhov · 2023-06-27T18:33:18Z

You can check the exact memory and compute things by looking at the runs linked in the wandb report (e.g. my RLHF run shows the memory consumption in the "System" charts)

Given that I've essentially repro'd the results and my PRs have been merged, I'm closing this issue and continuing in #471 with repro using the more compute-efficient multi-adapter paradigm. Feel free to keep commmenting about the repro and I'll try to respond.

dh2shin · 2023-06-29T17:41:03Z

Hi Michael, continuing the conversation from #401 here. When I try to run the supervised finetuning script out of the box, I get the following warning messages:
Training...
24 Using pad_token, but it is not set yet.
25 UserWarning: You passed packing=Trueto the SFTTrainer, and you are training your model withmax_stepsstrategy. The dataset will be iterated until themax_steps are reached.
26 warnings.warn(
27 FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
28 warnings.warn(
29 Token indices sequence length is longer than the specified maximum sequence length for this model (4899 > 2048). Running this sequence through the model will result in indexing errors
30 You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the callmethod is faster than using a method to encode the text followed by a call to thepad method to get a padded encoding.
31 use_cache=Trueis incompatible with gradient checkpointing. Settinguse_cache=False`...``

I'm getting a sudden spike in train loss around 800 global steps in, and I'm wondering if these warning messages have to do with them. Any ideas?

mnoukhov · 2023-06-29T18:44:14Z

None of the messages are related. There are other issues related to instability in training, and without more info it's hard to diagnose the problem. If you want some advice specific to your situation it would be helpful to know lots of things like reward, KL, etc...

If you just want to try some things out #462 found that setting a larger minibatch size and larger target KL could help

dh2shin · 2023-07-05T20:56:24Z

@mnoukhov Hi Michael, when using the reward model to build out the sentiment pipeline in rl_training.py, the output rewards/scores vary drastically depending on the batch_size used in sent_kwargs. I am wondering if you have investigated this issue more in-depth.

I'm also wondering if in reward_modeling.py, whether the padding_strategy should be True or max_length. Currently, I get the following error message:
UserWarning: max_length is ignored when padding=True and there is no truncation strategy. To pad to max length, use padding='max_length'.

wangzhao88 · 2023-08-03T07:28:16Z

Hi, I used your RL model (https://huggingface.co/mnoukhov/llama-7b-se-rl-peft) to test the SuperGLUE benchmark using the lm-evaluation-harness. The results are as follows:

Original Model Results:
| Task | Version | Metric | Value | | Stderr |
| boolq | 1 | acc | 0.7642 | ± | 0.0074 |
| cb | 1 | acc | 0.5536 | ± | 0.0670 |
| | | f1 | 0.4248 | | |
| copa | 0 | acc | 0.8800 | ± | 0.0500 |
| multirc | 1 | acc | 0.0084 | ± | 0.0018 |
| record | 0 | f1 | 0.9119 | ± | 0.0032 |
| | | em | 0.9044 | ± | 0.0032 |
| rte | 0 | acc | 0.6282 | ± | 0.0301 |
| wic | 0 | acc | 0.4953 | ± | 0.0198 |
| wsc | 0 | acc | 0.5673 | ± | 0.0474 |

However, when I trained my own RL model using the following command:
'''
accelerate launch --multi_gpu --num_machines 1 --num_processes 8 rl_training.py --model_name=【mnoukhov se model】 --reward_model_name=【mnoukhov rm model】 --adafactor=False --tokenizer_name=【mnoukhov se model】 --save_freq=100 --output_max_length=128 --batch_size=8 --gradient_accumulation_steps=8 --batched_gen=True --ppo_epochs=4 --seed=0 --learning_rate=1.4e-5 --early_stopping=True --output_dir=llama-se-rl-finetune-128-8-8-1.4e-5_adam
'''

After one day of training, the result of my own RL model (llama-se-rl-finetune-128-8-8-1.4e-5_adam, trained up to step 1300) is as follows:

Trained Model Results:
| Task | Version | Metric | Value | | Stderr |
| boolq | 1 | acc | 0.3783 | ± | 0.0085 |
| cb | 1 | acc | 0.4107 | ± | 0.0663 |
| | | f1 | 0.1941 | | |
| copa | 0 | acc | 0.5500 | ± | 0.0500 |
| multirc | 1 | acc | 0.0031 | ± | 0.0018 |
| record | 0 | f1 | 0.1186 | ± | 0.0032 |
| | | em | 0.1151 | ± | 0.0032 |
| rte | 0 | acc | 0.5271 | ± | 0.0301 |
| wic | 0 | acc | 0.5000 | ± | 0.0198 |
| wsc | 0 | acc | 0.6346 | ± | 0.0474 |

It appears that the training of the model did not achieve the desired performance.

Note: I made some assumptions in the text, such as specifying the link to the lm-evaluation-harness repository and providing some context based on the provided information. If there are any inaccuracies or specific information you would like to include, please let me know, and I can adjust the text accordingly.

lvwerra · 2023-08-03T08:46:29Z

Can you share the logs from the RL training? E.g. mean rewards and objective/kl are usually helpful metrics to look at to see if the model learned something.

wangzhao88 · 2023-08-03T09:01:05Z

Can you share the logs from the RL training? E.g. mean rewards and objective/kl are usually helpful metrics to look at to see if the model learned something.

hi, here is the logs:https://wandb.ai/630191510/trl/runs/eb02d7zh?workspace=user-630191510

lvwerra · 2023-08-03T12:39:22Z

Looks like there was an issue at step ~50: the reward went down significantly. Could you try with a different seed or lower learning rate? Also we added some stability measures in the latest release so try updating trl.

wangzhao88 · 2023-08-07T02:12:43Z

Hello! Here is the latest log: https://wandb.ai/630191510/trl/runs/kj2kkbq9?workspace=user-630191510

The loss curve looks normal, and the accuracy in SuperGLUE is also normal.

I suppose the key difference in ppo_trainer.py between the two branches is crucial.

lvwerra · 2023-08-07T12:44:05Z

That's great! So updating helped?

wangzhao88 · 2023-08-08T02:32:03Z

Can you tell me the differences between Branch[d78d917] and Branch[e448bb6] in terms of PPO training?
Their ppo_trainer.py are very similar.

zhangfudiyi · 2023-09-05T06:36:08Z

it‘s great jobs

mnoukhov mentioned this issue Jun 27, 2023

Reproducing StackLLaMA with Multi-Adapters #471

Closed

mnoukhov closed this as completed Jun 27, 2023

mnoukhov mentioned this issue Jul 4, 2023

Unable to reproduce the Hugging Face stack-llama results #489

Closed

lvwerra mentioned this issue Jul 17, 2023

Negative reward in PPO training #521

Closed

rajpabari mentioned this issue Aug 14, 2023

Struggling to Reproduce StackLLaMA #641

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing StackLLaMA #401

Reproducing StackLLaMA #401

mnoukhov commented Jun 1, 2023

younesbelkada commented Jun 2, 2023

mnoukhov commented Jun 4, 2023

dh2shin commented Jun 24, 2023 •

edited

Loading

mnoukhov commented Jun 24, 2023

dh2shin commented Jun 26, 2023

mnoukhov commented Jun 27, 2023

dh2shin commented Jun 27, 2023

mnoukhov commented Jun 27, 2023

dh2shin commented Jun 29, 2023

mnoukhov commented Jun 29, 2023

dh2shin commented Jul 5, 2023 •

edited

Loading

wangzhao88 commented Aug 3, 2023 •

edited

Loading

lvwerra commented Aug 3, 2023

wangzhao88 commented Aug 3, 2023 •

edited

Loading

lvwerra commented Aug 3, 2023

wangzhao88 commented Aug 7, 2023

lvwerra commented Aug 7, 2023

wangzhao88 commented Aug 8, 2023 •

edited

Loading

zhangfudiyi commented Sep 5, 2023

Reproducing StackLLaMA #401

Reproducing StackLLaMA #401

Comments

mnoukhov commented Jun 1, 2023

younesbelkada commented Jun 2, 2023

mnoukhov commented Jun 4, 2023

dh2shin commented Jun 24, 2023 • edited Loading

mnoukhov commented Jun 24, 2023

dh2shin commented Jun 26, 2023

mnoukhov commented Jun 27, 2023

dh2shin commented Jun 27, 2023

mnoukhov commented Jun 27, 2023

dh2shin commented Jun 29, 2023

mnoukhov commented Jun 29, 2023

dh2shin commented Jul 5, 2023 • edited Loading

wangzhao88 commented Aug 3, 2023 • edited Loading

lvwerra commented Aug 3, 2023

wangzhao88 commented Aug 3, 2023 • edited Loading

lvwerra commented Aug 3, 2023

wangzhao88 commented Aug 7, 2023

lvwerra commented Aug 7, 2023

wangzhao88 commented Aug 8, 2023 • edited Loading

zhangfudiyi commented Sep 5, 2023

dh2shin commented Jun 24, 2023 •

edited

Loading

dh2shin commented Jul 5, 2023 •

edited

Loading

wangzhao88 commented Aug 3, 2023 •

edited

Loading

wangzhao88 commented Aug 3, 2023 •

edited

Loading

wangzhao88 commented Aug 8, 2023 •

edited

Loading