-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-gpu training triggers CUDA out of memory error #2456
Comments
Hi! thanks for your contribution!, great first issue! |
Hi, what are your outputs of the validation_step? If there are any large tensors, it's likely they get synced back to root GPU by #2434 . We're working on that. cc @williamFalcon ^^ |
Hi - I actually haven't implemented the validation step yet. this just occurs on the training side |
what is your gpu consumption on a single gpu (used/available)? |
Fixed it!. I was calling |
@justusschock does that mean we should add back the all reduce for val? |
No, there were other issues with that as well :D Let's just keep it out for now. |
I have the same issue but couldnt solve it by removing .to('cuda'). when i do this i get error: |
I have the same issue. When I use 2 nodes everything seems fine. However, when I try to increase the number of nodes it causes CUDA out-of-memory error! |
Hi -
I am running into issues when going from single to multi-gpu training. Specifically, if I switch the line
pl.Trainer(gpus=1, precision=16, distributed_backend='ddp')
to
pl.Trainer(gpus=4, precision=16, distributed_backend='ddp')
I get the dreaded CUDA out of memory error. Is there any reason why the parallelism causes the GPU to receive more data?
The text was updated successfully, but these errors were encountered: