-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trainning on 8 A100 40GB GPUs? #19
Comments
I found that it can only run on my GPUs when bs is 1. Can you tell me what the configuration of your training resources is? Is it because my computing resources is insufficient? |
I normally use 8 A100 80G for training. If your GPU memory is limited, you can turn on gradient accumulation and fsdp. |
Thank you for your reply! Can this model be trained using fp16 or bf16? I found --bf 16 True in your vl_pretrain.sh but it seems that the following error occurs when I use fp16 or bf16 for training: This seems to be because the underlying implementation of ddetr only has float32 supported, and the same issue was found in its repository. Have you ever tried training with FP16 or BF16? |
Yes, the model is trained with |
Thank you for your response. I did follow In my previous discovery, ddetr running non-float32 would report an error: In the end I had to force ddetr to be implemented in fp32 based on @force_fp32 and convert some of the input types in roi_align.py so that the model would be trained in bf16. |
Thanks for your feedback. Do you mean all the model parameters are trained in fp32, even though you set bf16=True? Or just the ddetr parameters? bf16 works in my local environment as turn on bf16 significantly boosts the training speed compared with bf16 off. It is reasonable to see features converted to fp32 before input to the ddetr transformer, as according to the underlying cuda implementation, |
Is it feasible to train using 8 A100 GPUs with 40GB? I have encountered GPU out of memory during the pre-training phase.
The text was updated successfully, but these errors were encountered: