Increase in GPU memory usage with Pytorch-Lightning #1376

VitorGuizilini · 2020-04-04T16:31:43Z

Over the last week I have been porting my code on monocular depth estimation to Pytorch-Lightning, and everything is working perfectly. However, my models seem to require more GPU memory than before, to the point where I need to significantly decrease batch size at training time. These are the Trainer parameters I am using, and relevant versions:

FROM nvidia/cuda:10.1-devel-ubuntu18.04
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV PYTORCH_LIGHTNING_VERSION=0.7.1

cfg.arch.gpus = 8
cfg.arch.num_nodes = 1
cfg.arch.num_workers = 8
cfg.arch.distributed_backend = 'ddp'
cfg.arch.amp_level = 'O0'
cfg.arch.precision = 32
cfg.arch.benchmark = True 
cfg.arch.min_epochs = 1
cfg.arch.max_epochs = 50
cfg.arch.checkpoint_callback = False
cfg.arch.callbacks = []
cfg.arch.gradient_clip_val = 0.0
cfg.arch.accumulate_grad_batches = 1
cfg.arch.val_check_interval = 1.0
cfg.arch.check_val_every_n_epoch = 1
cfg.arch.num_sanity_val_steps = 0
cfg.arch.progress_bar_refresh_rate = 1
cfg.arch.fast_dev_run = False
cfg.arch.overfit_pct = 0.0
cfg.arch.train_percent_check = 1.0
cfg.arch.val_percent_check = 1.0
cfg.arch.test_percent_check = 1.0

Because of that (probably) I am having issues replicating my results, could you please advise on possible solutions? I will open-source the code as soon as I manage to replicate current results.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-04-04T16:32:24Z

Hi! thanks for your contribution!, great first issue!

Borda · 2020-04-05T07:00:59Z

Hi @vguizilini could you be more specific how much more memory is required?

williamFalcon · 2020-04-05T15:25:50Z

@jeremyjordan can we get that memory profiler?
@vguizilini mind trying again from master?

jeremyjordan · 2020-04-05T15:37:24Z

i thought we already log GPU mem usage?

https://pytorch-lightning.readthedocs.io/en/0.7.1/debugging.html#log-gpu-usage

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/logging.py#L55

VitorGuizilini · 2020-04-05T17:20:17Z

Memory usage for my original implementation (horovod for distributed training)

Memory usage for my Pytorch-Lightning implementation (ddp)

I'm loading the same configuration and same networks in both.
I'm still learning to use Pytorch-Lightning, what should I profile next?

jeremyjordan · 2020-04-05T17:31:07Z

@neggert or @williamFalcon any ideas why GPU memory isn't consistent across the nodes?

VitorGuizilini · 2020-04-09T02:52:21Z

Following up on this issue, is there anything else I should provide to facilitate debugging?

VitorGuizilini added bug Something isn't working help wanted Open to be worked on labels Apr 4, 2020

Borda added feature Is an improvement or enhancement information needed and removed bug Something isn't working labels Apr 5, 2020

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed in #2029 Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase in GPU memory usage with Pytorch-Lightning #1376

Increase in GPU memory usage with Pytorch-Lightning #1376

VitorGuizilini commented Apr 4, 2020

github-actions bot commented Apr 4, 2020

Borda commented Apr 5, 2020

williamFalcon commented Apr 5, 2020 •

edited

Loading

jeremyjordan commented Apr 5, 2020 •

edited

Loading

VitorGuizilini commented Apr 5, 2020

jeremyjordan commented Apr 5, 2020

VitorGuizilini commented Apr 9, 2020

Increase in GPU memory usage with Pytorch-Lightning #1376

Increase in GPU memory usage with Pytorch-Lightning #1376

Comments

VitorGuizilini commented Apr 4, 2020

github-actions bot commented Apr 4, 2020

Borda commented Apr 5, 2020

williamFalcon commented Apr 5, 2020 • edited Loading

jeremyjordan commented Apr 5, 2020 • edited Loading

VitorGuizilini commented Apr 5, 2020

jeremyjordan commented Apr 5, 2020

VitorGuizilini commented Apr 9, 2020

williamFalcon commented Apr 5, 2020 •

edited

Loading

jeremyjordan commented Apr 5, 2020 •

edited

Loading