Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apex hangs on cudaFree #599

Open
chengmengli06 opened this issue Nov 12, 2019 · 1 comment
Open

apex hangs on cudaFree #599

chengmengli06 opened this issue Nov 12, 2019 · 1 comment

Comments

@chengmengli06
Copy link

model parallel with 3gpus, if input is 512x512x280, then it could run; if input is 512x512x320, then it hangs.

The stacktrace:
#0 __vdso_clock_gettime (clock=4, ts=0x7f9170ffabc0) at arch/x86/vdso/vclock_gettime.c:256
#1 0x00007f928421896d in clock_gettime () from /lib64/libc.so.6
#2 0x00007f917fc5d01e in ?? () from /lib64/libcuda.so.1
#3 0x00007f917fd18fc7 in ?? () from /lib64/libcuda.so.1
#4 0x00007f917fc0531c in ?? () from /lib64/libcuda.so.1
#5 0x00007f917fb4141c in ?? () from /lib64/libcuda.so.1
#6 0x00007f917fca2370 in cuMemFree_v2 () from /lib64/libcuda.so.1
#7 0x00007f9272e1d690 in ?? () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libcudart-1b201d85.so.10.1
#8 0x00007f9272e550c1 in cudaFree () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libcudart-1b201d85.so.10.1
#9 0x00007f92796128c6 in c10::cuda::CUDACachingAllocator::THCCachingAllocator::free_blocks(std::set<c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block*, bool ()(c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block const, c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block const*), std::allocator<c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block*> >&, std::_Rb_tree_const_iterator<c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block*>, std::_Rb_tree_const_iterator<c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block*>) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libc10_cuda.so
#10 0x00007f9279615971 in c10::cuda::CUDACachingAllocator::THCCachingAllocator::malloc(void**, unsigned long, CUstream_st*) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libc10_cuda.so
#11 0x00007f9279616e6e in c10::cuda::CUDACachingAllocator::CudaCachingAllocator::allocate(unsigned long) const () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libc10_cuda.so
#12 0x00007f919374c7e9 in at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#13 0x00007f9192199538 in at::CUDAType::(anonymous namespace)::empty(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#14 0x00007f9191c52c28 in torch::autograd::VariableType::(anonymous namespace)::empty(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#15 0x00007f918fba8521 in at::native::to_impl(at::Tensor const&, c10::TensorOptions const&, bool) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#16 0x00007f918fba8ec2 in at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#17 0x00007f918fed1b40 in at::TypeDefault::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#18 0x00007f9191a64873 in torch::autograd::VariableType::(anonymous namespace)::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#19 0x00007f9191d56339 in torch::autograd::CopyBackwards::apply(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable >&&) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#20 0x00007f9191d4dc16 in torch::autograd::Node::operator()(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable >&&) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#21 0x00007f9191d47227 in torch::autograd::Engine::evaluate_function(torch::autograd::NodeTask&) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#22 0x00007f9191d49234 in torch::autograd::Engine::thread_main(torch::autograd::GraphTask*) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#23 0x00007f927a251d4a in torch::autograd::python::PythonEngine::thread_init(int) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch_python.so
#24 0x00007f927ad75ecf in execute_native_thread_routine () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so
#25 0x00007f9284de9e25 in start_thread () from /lib64/libpthread.so.0
#26 0x00007f9284202bad in clone () from /lib64/libc.so.6

@mcarilli
Copy link
Contributor

mcarilli commented Nov 12, 2019

I can't tell if this has anything to do with Apex. I would try running without Apex first. If you are doing any form of distributed training, you should prefer torch.nn.parallel.DistributedDataParallel rather than apex.parallel.DistributedDataParallel. The native Torch DistributedDataParallel is guaranteed to be compatible with whatever version of Pytorch you're using. They are drop in replacements for each other, aside from some constructor arguments, as shown here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants