Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase in GPU memory usage with Pytorch-Lightning #1376

Closed
VitorGuizilini opened this issue Apr 4, 2020 · 7 comments · Fixed by #2029
Closed

Increase in GPU memory usage with Pytorch-Lightning #1376

VitorGuizilini opened this issue Apr 4, 2020 · 7 comments · Fixed by #2029
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@VitorGuizilini
Copy link
Contributor

Over the last week I have been porting my code on monocular depth estimation to Pytorch-Lightning, and everything is working perfectly. However, my models seem to require more GPU memory than before, to the point where I need to significantly decrease batch size at training time. These are the Trainer parameters I am using, and relevant versions:

FROM nvidia/cuda:10.1-devel-ubuntu18.04
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV PYTORCH_LIGHTNING_VERSION=0.7.1
cfg.arch.gpus = 8
cfg.arch.num_nodes = 1
cfg.arch.num_workers = 8
cfg.arch.distributed_backend = 'ddp'
cfg.arch.amp_level = 'O0'
cfg.arch.precision = 32
cfg.arch.benchmark = True 
cfg.arch.min_epochs = 1
cfg.arch.max_epochs = 50
cfg.arch.checkpoint_callback = False
cfg.arch.callbacks = []
cfg.arch.gradient_clip_val = 0.0
cfg.arch.accumulate_grad_batches = 1
cfg.arch.val_check_interval = 1.0
cfg.arch.check_val_every_n_epoch = 1
cfg.arch.num_sanity_val_steps = 0
cfg.arch.progress_bar_refresh_rate = 1
cfg.arch.fast_dev_run = False
cfg.arch.overfit_pct = 0.0
cfg.arch.train_percent_check = 1.0
cfg.arch.val_percent_check = 1.0
cfg.arch.test_percent_check = 1.0

Because of that (probably) I am having issues replicating my results, could you please advise on possible solutions? I will open-source the code as soon as I manage to replicate current results.

@VitorGuizilini VitorGuizilini added bug Something isn't working help wanted Open to be worked on labels Apr 4, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2020

Hi! thanks for your contribution!, great first issue!

@Borda Borda added feature Is an improvement or enhancement information needed and removed bug Something isn't working labels Apr 5, 2020
@Borda
Copy link
Member

Borda commented Apr 5, 2020

Hi @vguizilini could you be more specific how much more memory is required?

@williamFalcon
Copy link
Contributor

williamFalcon commented Apr 5, 2020

@jeremyjordan can we get that memory profiler?
@vguizilini mind trying again from master?

@VitorGuizilini
Copy link
Contributor Author

Memory usage for my original implementation (horovod for distributed training)

image

Memory usage for my Pytorch-Lightning implementation (ddp)

image

I'm loading the same configuration and same networks in both.
I'm still learning to use Pytorch-Lightning, what should I profile next?

@jeremyjordan
Copy link
Contributor

@neggert or @williamFalcon any ideas why GPU memory isn't consistent across the nodes?

@VitorGuizilini
Copy link
Contributor Author

Following up on this issue, is there anything else I should provide to facilitate debugging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants