apex hangs on cudaFree #599

chengmengli06 · 2019-11-12T10:26:14Z

model parallel with 3gpus, if input is 512x512x280, then it could run; if input is 512x512x320, then it hangs.

The stacktrace:
#0 __vdso_clock_gettime (clock=4, ts=0x7f9170ffabc0) at arch/x86/vdso/vclock_gettime.c:256
#1 0x00007f928421896d in clock_gettime () from /lib64/libc.so.6
#2 0x00007f917fc5d01e in ?? () from /lib64/libcuda.so.1
#3 0x00007f917fd18fc7 in ?? () from /lib64/libcuda.so.1
#4 0x00007f917fc0531c in ?? () from /lib64/libcuda.so.1
#5 0x00007f917fb4141c in ?? () from /lib64/libcuda.so.1
#6 0x00007f917fca2370 in cuMemFree_v2 () from /lib64/libcuda.so.1
#7 0x00007f9272e1d690 in ?? () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libcudart-1b201d85.so.10.1
#8 0x00007f9272e550c1 in cudaFree () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libcudart-1b201d85.so.10.1
#9 0x00007f92796128c6 in c10::cuda::CUDACachingAllocator::THCCachingAllocator::free_blocks(std::set<c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block*, bool ()(c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block const, c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block const*), std::allocator<c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block*> >&, std::_Rb_tree_const_iterator<c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block*>, std::_Rb_tree_const_iterator<c10::cuda::CUDACachingAllocator::(anonymous namespace)::Block*>) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libc10_cuda.so
#10 0x00007f9279615971 in c10::cuda::CUDACachingAllocator::THCCachingAllocator::malloc(void**, unsigned long, CUstream_st*) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libc10_cuda.so
#11 0x00007f9279616e6e in c10::cuda::CUDACachingAllocator::CudaCachingAllocator::allocate(unsigned long) const () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libc10_cuda.so
#12 0x00007f919374c7e9 in at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#13 0x00007f9192199538 in at::CUDAType::(anonymous namespace)::empty(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#14 0x00007f9191c52c28 in torch::autograd::VariableType::(anonymous namespace)::empty(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#15 0x00007f918fba8521 in at::native::to_impl(at::Tensor const&, c10::TensorOptions const&, bool) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#16 0x00007f918fba8ec2 in at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#17 0x00007f918fed1b40 in at::TypeDefault::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#18 0x00007f9191a64873 in torch::autograd::VariableType::(anonymous namespace)::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#19 0x00007f9191d56339 in torch::autograd::CopyBackwards::apply(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable >&&) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#20 0x00007f9191d4dc16 in torch::autograd::Node::operator()(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable >&&) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#21 0x00007f9191d47227 in torch::autograd::Engine::evaluate_function(torch::autograd::NodeTask&) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#22 0x00007f9191d49234 in torch::autograd::Engine::thread_main(torch::autograd::GraphTask*) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch.so
#23 0x00007f927a251d4a in torch::autograd::python::PythonEngine::thread_init(int) () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/lib/libtorch_python.so
#24 0x00007f927ad75ecf in execute_native_thread_routine () from /apsarapangu/disk1/mengli.cml/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so
#25 0x00007f9284de9e25 in start_thread () from /lib64/libpthread.so.0
#26 0x00007f9284202bad in clone () from /lib64/libc.so.6

mcarilli · 2019-11-12T18:01:25Z

I can't tell if this has anything to do with Apex. I would try running without Apex first. If you are doing any form of distributed training, you should prefer torch.nn.parallel.DistributedDataParallel rather than apex.parallel.DistributedDataParallel. The native Torch DistributedDataParallel is guaranteed to be compatible with whatever version of Pytorch you're using. They are drop in replacements for each other, aside from some constructor arguments, as shown here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apex hangs on cudaFree #599

apex hangs on cudaFree #599

chengmengli06 commented Nov 12, 2019

mcarilli commented Nov 12, 2019 •

edited

Loading

apex hangs on cudaFree #599

apex hangs on cudaFree #599

Comments

chengmengli06 commented Nov 12, 2019

mcarilli commented Nov 12, 2019 • edited Loading

mcarilli commented Nov 12, 2019 •

edited

Loading