-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Finetuning on multi-GPU (FSDP) does not initialize with the foundation model #652
Comments
I had no answer on that, but it seems to me a quite critical bug 🤔 It does not affect LoRA only : all the finetuning scripts seem to be affected by that. (just, what makes it worse with LoRA in comparison to "full" finetuning, is that you cannot even recover the model that was trained from scratch, given that only LoRA weights are saved) |
Hi @Jeronymous. The idea in that piece of code is to initialize the model randomly because the pretrained weights are loaded after https://github.com/Lightning-AI/lit-gpt/blob/bf60124fa72a56436c7d4fecc093c7fc48e84433/finetune/lora.py#L147-L148 Can you verify that this is happening for you? cc @awaelchli |
Yes, I checked:
We are using :
Maybe what happens (a guess) is that weight tensors are not allocated when Also, I don't understand why would initialization strategy be different between single-GPU and multi-GPU (the condition " |
@carmocca @awaelchli Here is an evidence that it has a chance of being a bug (not a misuse from us): |
@Jeronymous I tried FSDP for the first time using the default code, because I was running into OOM with one GPU. The code would not proceed beyond the setting the seed the first time (it was stalled for about 2 hours)
Another thing I saw was different (idk if this is important), Single GPU runs set the seed to the same number the second time - the multi gpu run did not. For now, I reduced my sample max_seq_len and switched single GPU. I'm using A10s, I would upgrade if I could, but AWS won't give me a single A100(only 8 🤣 ) EDIT: The finetuning just randomly crashes and doesn't really output an error message when it does. |
I have the similar issue. If I set the init_weight = True. The loss is aroun 8 or 9. It means the model doesn't load the checkpoint successfully. I use the torchrun to initate the program by the way. I cannot use the python3 xxx.py because my machine set up differently. |
@DevasiaThomas your problems seem to be memory overflow (which is different from the issue opened here, which concerns model initialization on multi-GPU). I guess you should use a lower Multi-GPU shouldn't be of particular help, because it does not reduce the memory for each GPU (just the training should consume samples faster, and note that the actual batch size, i.e. number of samples between 2 model updates, is " |
Also here: #689 (comment) I never understood why this issue was labeled a "question"... |
Sorry @Jeronymous. We'll look into this asap |
@Jeronymous I looked at this again. On lit-gpt main with lightning 2.1.2 and PyTorch 2.2 nightly I see the finetuning scripts working fine with FSDP (devices=2). The model gets loaded correctly and the loss quickly converges to < 1.0. I don't see anything obviously wrong here. Then given your info here #652 (comment) I checked out lit-gpt and lightning at this commit and ran again. And the same observations. I did this with default settings, meaning it uses the stabilityai/stablelm-base-alpha-3b. Please share any changes you've made to lit-gpt and the scripts locally and the checkpoint/model-family you are loading.
Of course that's still possible. You can always merge the lora weights onto the original checkpoint, that's by design. See generate/lora.py for how that's done. If you feel this is inconvenient, you can just change the line in the script to save the full checkpoint instead of just lora by removing the filter: @DevasiaThomas FYI This warning can be ignored, and in the latest version of Lightning it won't appear anymore in this context. |
There is also a script to merge LoRA weights with the pre-trained ones. |
Thank you for having a look @awaelchli . I am sorry if you can't reproduce. I'm giving another try with the most recent versions I haven't modified lit-gpt, and we are finetuning Not important, but I think there was a misunderstanding on this:
I was talking about the bug I faced: it trained from a randomly initialized model (instead of the original checkpoint) which I can't recover. So in this setting, I don't have the original model on which to apply the LoRA weights (to recover the full model that was trained from scratch). |
So I gave it another try. And I continue having the same issue. I updated to the latest version:
I retried finetuning (or rather continual pretraining) of If I use
If I use
The loss ranges are quite different... |
Seeing this I checked again, I re-downloaded the Falcon 7B checkpoint using Please note that for a 7B pretrained checkpoint, the CE loss should be around ~2.5, and for a randomly initialized model it would definitely be larger than 7. |
Also, please share the printed output of these two lines: |
I also tried to do LoRA fine-tuning. The latest code from main branch, the latest packages. I used
This is what I got when I tried to fine-tune: main ~/lit-gpt python finetune/lora.py --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision bf16-true
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 4, 'learning_rate': 0.0003, 'batch_size': 128, 'micro_batch_size': 4, 'gradient_accumulation_iters': 32, 'max_iters': 10, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
[rank: 0] Seed set to 1337
[rank: 1] Seed set to 1337
[rank: 3] Seed set to 1337
[rank: 2] Seed set to 1337
Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'name': 'falcon-7b', 'hf_config': {'org': 'tiiuae', 'name': 'falcon-7b'}, 'block_size': 2048, 'vocab_size': 65024, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'gelu_approximate': 'none', 'intermediate_size': 18176, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 64, 'rope_n_elem': 64}
Number of trainable parameters: 3,506,176
Number of non trainable parameters: 7,217,189,760
[rank: 3] Seed set to 1340
[rank: 0] Seed set to 1337
[rank: 2] Seed set to 1339
[rank: 1] Seed set to 1338
The longest sequence length in the train data is 1079, the model's maximum sequence length is 1079 and context length is 2048
iter 1 step 0: loss 1.7293, iter time: 10214.60ms
iter 2 step 0: loss 2.5372, iter time: 5275.93ms
iter 3 step 0: loss 2.3912, iter time: 5251.75ms
iter 4 step 0: loss 2.3706, iter time: 5457.99ms
iter 5 step 0: loss 2.1239, iter time: 5294.34ms
iter 6 step 0: loss 2.3765, iter time: 5302.96ms
iter 7 step 0: loss 2.0163, iter time: 5307.21ms
iter 8 step 0: loss 1.8228, iter time: 5372.66ms
iter 9 step 0: loss 2.7029, iter time: 5237.35ms
iter 10 step 0: loss 2.1403, iter time: 5389.18ms
Training time: 58.26s
Memory used: 20.53 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth' The loss values were exactly the same for |
When experimenting the adaptation of Falcon on multi-GPU with finetune/lora.py, we had surprisingly bad results.
After investigation, we realized that we were actually training a randomly initialized model.
(although only checkpointing the LoRA weights, so that model trained from scratch was just lost...).
In other words, the foundation model (Falcon) was not properly loaded.
It seems to be due to the use of
fabric.init_module(empty_init=True)
at this line:https://github.com/Lightning-AI/lit-gpt/blob/bf60124fa72a56436c7d4fecc093c7fc48e84433/finetune/lora.py#L128
If we use
empty_init=False
it trains correctly. I am not sure it's the right fix, though.The text was updated successfully, but these errors were encountered: