multi-gpu training triggers CUDA out of memory error #2456

griff4692 · 2020-07-01T21:45:01Z

Hi -

I am running into issues when going from single to multi-gpu training. Specifically, if I switch the line

pl.Trainer(gpus=1, precision=16, distributed_backend='ddp')

to

pl.Trainer(gpus=4, precision=16, distributed_backend='ddp')

I get the dreaded CUDA out of memory error. Is there any reason why the parallelism causes the GPU to receive more data?

The text was updated successfully, but these errors were encountered:

github-actions · 2020-07-01T21:46:18Z

Hi! thanks for your contribution!, great first issue!

justusschock · 2020-07-02T06:34:56Z

Hi, what are your outputs of the validation_step? If there are any large tensors, it's likely they get synced back to root GPU by #2434 . We're working on that.

cc @williamFalcon ^^

griff4692 · 2020-07-02T12:27:05Z

Hi - I actually haven't implemented the validation step yet. this just occurs on the training side

justusschock · 2020-07-02T13:04:39Z

what is your gpu consumption on a single gpu (used/available)?

griff4692 · 2020-07-02T13:34:42Z

On single gpu, I am using 5/11 GB. The problem seems to be that when I switch over to multiple GPUs, there is an explosion of processes created on the first GPU. Any ideas what could be causing this?

griff4692 · 2020-07-02T13:39:04Z

Fixed it!. I was calling .to('cuda') on my input tensors in my Dataset __get__item function which caused all the data to be uploaded to the first GPU. Removed that and solved the problem.

williamFalcon · 2020-07-02T13:45:20Z

@justusschock does that mean we should add back the all reduce for val?

justusschock · 2020-07-02T15:17:56Z

No, there were other issues with that as well :D Let's just keep it out for now.

MuhammadWaleedUsman · 2021-03-20T11:16:56Z

Fixed it!. I was calling .to('cuda') on my input tensors in my Dataset __get__item function which caused all the data to be uploaded to the first GPU. Removed that and solved the problem.

I have the same issue but couldnt solve it by removing .to('cuda'). when i do this i get error:
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

ahmadikalkhorani · 2022-10-04T17:22:37Z

I have the same issue. When I use 2 nodes everything seems fine. However, when I try to increase the number of nodes it causes CUDA out-of-memory error!

griff4692 added bug Something isn't working help wanted Open to be worked on labels Jul 1, 2020

justusschock assigned williamFalcon Jul 2, 2020

Borda added the priority: 0 High priority task label Jul 2, 2020

williamFalcon mentioned this issue Jul 2, 2020

removed auto val reduce #2462

Merged

williamFalcon closed this as completed in #2462 Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu training triggers CUDA out of memory error #2456

multi-gpu training triggers CUDA out of memory error #2456

griff4692 commented Jul 1, 2020

github-actions bot commented Jul 1, 2020

justusschock commented Jul 2, 2020 •

edited

Loading

griff4692 commented Jul 2, 2020

justusschock commented Jul 2, 2020

griff4692 commented Jul 2, 2020

griff4692 commented Jul 2, 2020

williamFalcon commented Jul 2, 2020

justusschock commented Jul 2, 2020

MuhammadWaleedUsman commented Mar 20, 2021

ahmadikalkhorani commented Oct 4, 2022

multi-gpu training triggers CUDA out of memory error #2456

multi-gpu training triggers CUDA out of memory error #2456

Comments

griff4692 commented Jul 1, 2020

github-actions bot commented Jul 1, 2020

justusschock commented Jul 2, 2020 • edited Loading

griff4692 commented Jul 2, 2020

justusschock commented Jul 2, 2020

griff4692 commented Jul 2, 2020

griff4692 commented Jul 2, 2020

williamFalcon commented Jul 2, 2020

justusschock commented Jul 2, 2020

MuhammadWaleedUsman commented Mar 20, 2021

ahmadikalkhorani commented Oct 4, 2022

justusschock commented Jul 2, 2020 •

edited

Loading