Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem on finetuning llama and baichuan with new version transformers #26816

Closed
gxy-gxy opened this issue Oct 15, 2023 · 6 comments
Closed

problem on finetuning llama and baichuan with new version transformers #26816

gxy-gxy opened this issue Oct 15, 2023 · 6 comments

Comments

@gxy-gxy
Copy link

gxy-gxy commented Oct 15, 2023

When I tried to finetune llama model with sharegpt dataset, I got these loss curves:
image
the green loss curve is trained with transformers 4.33.2 version and the orange loss curve is trained with transformers 4.28.1.
obviously, the green one is abnormal and the orange one is correct. I wonder why this happens? The only thing I do is changing the Transformers version. Is this some bugs in transformers or I made something wrong?

@gxy-gxy
Copy link
Author

gxy-gxy commented Oct 15, 2023

I also observed this phenomenon when I tried to fine-tune baichuan model.
here is the loss curve trained with transformers 4.32:
image

this is the loss curve trained with transformers 4.28:
image

@gxy-gxy
Copy link
Author

gxy-gxy commented Oct 15, 2023

I finetuned all the models above with the code in FastChat repository on A100-80G.
here is my code:

torchrun --nproc_per_node=8 --master_port=20001 fastchat/train/train_xformers.py  \
    --model_name_or_path llama-7b \
    --data_path fschat.json \
    --bf16 True \
    --output_dir output\
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --save_strategy "epoch" \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 4096 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

@ArthurZucker
Copy link
Collaborator

Hey 🤗 thanks for opening an issue! We try to keep the github issues for bugs/feature requests. We had a similar issue being tracked here #26498 where you can find good tips!

Otherwise could you ask your question on the forum instead? I'm sure the community will be of help!

Thanks!

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@xhan77
Copy link

xhan77 commented Nov 21, 2023

I was using transformers 4.33.2 (along with fsdp implemented in pytorch and the accelerate package from HF) and also observed the issue when pretraining llama from scratch: a quickly failing loss when using fsdp+bf16. There's no issue with fsdp+fp32 or ddp+bf16. I upgraded to 4.35.2 and the issue seems to be resolved. I don't know the exact reason behind this though.

Before upgrading transformers, I incorporated many tips from #26498 but they didn't help much in my case.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants