You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A bit of research led me to this issue in PyTorch repo: pytorch/pytorch#25752. This is not the only place where empty_cache is called, by the way. Did not check, but other calls probably work the same way.
For now I duct-tape-fixed it for myself by running my script with CUDA_VISIBLE_DEVICES=2 and setting gpus=[0]. Not sure how to fix it properly, though. Would be glad if someone would take a look.
The text was updated successfully, but these errors were encountered:
This is primarily a pytorch problem, but a fix for us could be to always use torch.cuda.set_device somewhere before our first empty_cache call. I'm not sure what other side effects that might have. Feel free to submit a PR if you'd like
Training on GPU other than gpu #0 allocates a ~500Mb chunk of memory on gpu #0, the memory is totally unused and should not be allocated at all. Debugging shows that initial allocation happens at this line: https://github.com/williamFalcon/pytorch-lightning/blob/2f01c03b38fc16618aa9839d39e0ae5a142c0559/pytorch_lightning/trainer/trainer.py#L517
A bit of research led me to this issue in PyTorch repo: pytorch/pytorch#25752. This is not the only place where
empty_cache
is called, by the way. Did not check, but other calls probably work the same way.For now I duct-tape-fixed it for myself by running my script with
CUDA_VISIBLE_DEVICES=2
and settinggpus=[0]
. Not sure how to fix it properly, though. Would be glad if someone would take a look.The text was updated successfully, but these errors were encountered: