Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caffe stuck waiting on multiple boost::condition_variable in all threads in caffe::BlockingQueue #4904

Closed
JonBoyleCoding opened this issue Oct 26, 2016 · 2 comments

Comments

@JonBoyleCoding
Copy link

When trying to run an AlexNet implementation, I end up in a state where caffe just hangs. This appears to happen no matter what settings I change

Initially the symptom sounds the same as what was discussed in issue #3965; however, their solution is not applicable in this scenario as my data layers have their batch size set unlike the previous issue.

I compiled caffe in debug mode to try and determine what was going on, and it appears that all but one thread is waiting on a conditional variable, either boost or pthread (although the pthread specific one is coming from libcuda). They all appear to be waiting inside the BlockingQueue in some form. The stack trace from gdb can be found at the bottom.

I have yet to look at the BlockingQueue code in depth, but from the stack traces it appears that a signal is not being sent correctly to wake up threads or that some wake up logic is incorrectly keeping all threads asleep.

As far as I can tell, my data layer is formed correctly. However, even if the layer is malformed, the issue of threads freezing is perhaps not desired behaviour.

layer {
  name: "train_data"
  type: "Data"
  top: "data"
  top: "label"
  transform_param {
    crop_size: 227
  }
  data_param {
    source: "/data/Experiments/Classification/2016-10-CaffeTest/train-exemplars-with-labels-lmdb"
    backend: LMDB
    batch_size: 1
  }
#  include { phase: TRAIN }
}

&"thread apply all bt\n"
~"\nThread 7 (Thread 0x7fffd67fc700 (LWP 10809)):\n"
~"#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185\n"
~"#1 0x00007ffff773ce25 in boost::condition_variable::wait (this=0x4804e48, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:73\n"
~"#2 0x00007ffff773da02 in caffe::BlockingQueuecaffe::Batch<float_>::pop (this=0x4804b98, log_on_wait=...) at /data/Programming/Libraries/caffe/src/caffe/util/blocking_queue.cpp:52\n"
~"#3 0x00007ffff781bfee in caffe::BasePrefetchingDataLayer::InternalThreadEntry (this=0x4804670) at /data/Programming/Libraries/caffe/src/caffe/layers/base_data_layer.cpp:86\n"
~"#4 0x00007ffff772dcdd in caffe::InternalThread::entry (this=0x48049a0, device=0, mode=caffe::Caffe::GPU, rand_seed=-2010344607, solver_count=1, root_solver=true) at /data/Programming/Libraries/caffe/src/caffe/internal_thread.cpp:51\n"
~"#5 0x00007ffff7731461 in boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>::operator() (this=0x480b0f8, p=0x48049a0, a1=0, a2=caffe::Caffe::GPU, a3=-2010344607, a4=1, a5=true) at /usr/include/boost/bind/mem_fn_template.hpp:619\n"
~"#6 0x00007ffff773133f in boost::_bi::list6<boost::_bi::valuecaffe::InternalThread_, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value >::operator()<boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list0> (this=0x480b108, f=..., a=...) at /usr/include/boost/bind/bind.hpp:596\n"
~"#7 0x00007ffff7731183 in boost::_bi::bind_t<void, boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value > >::operator() (this=0x480b0f8) at /usr/include/boost/bind/bind_template.hpp:20\n"
~"#8 0x00007ffff773106a in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value > > >::run (this=0x480af40) at /usr/include/boost/thread/detail/thread.hpp:117\n"
~"#9 0x00007ffff5a6fe7a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0\n"
~"#10 0x00007ffff584e184 in start_thread (arg=0x7fffd67fc700) at pthread_create.c:312\n"
~"#11 0x00007ffff5d7537d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111\n"

~"\nThread 6 (Thread 0x7fffd6ffd700 (LWP 10808)):\n"
~"#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185\n"
~"#1 0x00007ffff773ce25 in boost::condition_variable::wait (this=0x4805388, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:73\n"
~"#2 0x00007ffff773e5ba in caffe::BlockingQueuecaffe::Datum*::pop (this=0x4805290, log_on_wait=...) at /data/Programming/Libraries/caffe/src/caffe/util/blocking_queue.cpp:52\n"
~"#3 0x00007ffff77902a9 in caffe::DataReader::Body::read_one (this=0x4805d60, cursor=0x7ff150003ad0, qp=0x4805290) at /data/Programming/Libraries/caffe/src/caffe/data_reader.cpp:106\n"
~"#4 0x00007ffff779009a in caffe::DataReader::Body::InternalThreadEntry (this=0x4805d60) at /data/Programming/Libraries/caffe/src/caffe/data_reader.cpp:92\n"
~"#5 0x00007ffff772dcdd in caffe::InternalThread::entry (this=0x4805d60, device=0, mode=caffe::Caffe::GPU, rand_seed=1630068778, solver_count=1, root_solver=true) at /data/Programming/Libraries/caffe/src/caffe/internal_thread.cpp:51\n"
~"#6 0x00007ffff7731461 in boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>::operator() (this=0x4806768, p=0x4805d60, a1=0, a2=caffe::Caffe::GPU, a3=1630068778, a4=1, a5=true) at /usr/include/boost/bind/mem_fn_template.hpp:619\n"
~"#7 0x00007ffff773133f in boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value >::operator()<boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list0> (this=0x4806778, f=..., a=...) at /usr/include/boost/bind/bind.hpp:596\n"
~"#8 0x00007ffff7731183 in boost::_bi::bind_t<void, boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value > >::operator() (this=0x4806768) at /usr/include/boost/bind/bind_template.hpp:20\n"
~"#9 0x00007ffff773106a in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value > > >::run (this=0x48065b0) at /usr/include/boost/thread/detail/thread.hpp:117\n"
~"#10 0x00007ffff5a6fe7a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0\n"
~"#11 0x00007ffff584e184 in start_thread (arg=0x7fffd6ffd700) at pthread_create.c:312\n"
~"#12 0x00007ffff5d7537d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111\n"

~"\nThread 5 (Thread 0x7fffd77fe700 (LWP 10807)):\n"
~"#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185\n"
~"#1 0x00007ffff773ce25 in boost::condition_variable::wait (this=0x42bfad8, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:73\n"
~"#2 0x00007ffff773da02 in caffe::BlockingQueuecaffe::Batch<float_>::pop (this=0x42c1018, log_on_wait=...) at /data/Programming/Libraries/caffe/src/caffe/util/blocking_queue.cpp:52\n"
~"#3 0x00007ffff781bfee in caffe::BasePrefetchingDataLayer::InternalThreadEntry (this=0x42c0af0) at /data/Programming/Libraries/caffe/src/caffe/layers/base_data_layer.cpp:86\n"
~"#4 0x00007ffff772dcdd in caffe::InternalThread::entry (this=0x42c0e20, device=0, mode=caffe::Caffe::GPU, rand_seed=-1871154542, solver_count=1, root_solver=true) at /data/Programming/Libraries/caffe/src/caffe/internal_thread.cpp:51\n"
~"#5 0x00007ffff7731461 in boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>::operator() (this=0x4804468, p=0x42c0e20, a1=0, a2=caffe::Caffe::GPU, a3=-1871154542, a4=1, a5=true) at /usr/include/boost/bind/mem_fn_template.hpp:619\n"
~"#6 0x00007ffff773133f in boost::_bi::list6<boost::_bi::valuecaffe::InternalThread_, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value >::operator()<boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list0> (this=0x4804478, f=..., a=...) at /usr/include/boost/bind/bind.hpp:596\n"
~"#7 0x00007ffff7731183 in boost::_bi::bind_t<void, boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value > >::operator() (this=0x4804468) at /usr/include/boost/bind/bind_template.hpp:20\n"
~"#8 0x00007ffff773106a in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value > > >::run (this=0x48042b0) at /usr/include/boost/thread/detail/thread.hpp:117\n"
~"#9 0x00007ffff5a6fe7a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0\n"
~"#10 0x00007ffff584e184 in start_thread (arg=0x7fffd77fe700) at pthread_create.c:312\n"
~"#11 0x00007ffff5d7537d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111\n"

~"\nThread 4 (Thread 0x7fffd7fff700 (LWP 10806)):\n"
~"#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185\n"
~"#1 0x00007ffff773ce25 in boost::condition_variable::wait (this=0x42bfd98, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:73\n"
~"#2 0x00007ffff773e5ba in caffe::BlockingQueuecaffe::Datum*::pop (this=0x42bf4a0, log_on_wait=...) at /data/Programming/Libraries/caffe/src/caffe/util/blocking_queue.cpp:52\n"
~"#3 0x00007ffff77902a9 in caffe::DataReader::Body::read_one (this=0x42c1570, cursor=0x7fffcc003ad0, qp=0x42bf4a0) at /data/Programming/Libraries/caffe/src/caffe/data_reader.cpp:106\n"
~"#4 0x00007ffff779009a in caffe::DataReader::Body::InternalThreadEntry (this=0x42c1570) at /data/Programming/Libraries/caffe/src/caffe/data_reader.cpp:92\n"
~"#5 0x00007ffff772dcdd in caffe::InternalThread::entry (this=0x42c1570, device=0, mode=caffe::Caffe::GPU, rand_seed=2067838403, solver_count=1, root_solver=true) at /data/Programming/Libraries/caffe/src/caffe/internal_thread.cpp:51\n"
~"#6 0x00007ffff7731461 in boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>::operator() (this=0x42c1d68, p=0x42c1570, a1=0, a2=caffe::Caffe::GPU, a3=2067838403, a4=1, a5=true) at /usr/include/boost/bind/mem_fn_template.hpp:619\n"
~"#7 0x00007ffff773133f in boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value >::operator()<boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list0> (this=0x42c1d78, f=..., a=...) at /usr/include/boost/bind/bind.hpp:596\n"
~"#8 0x00007ffff7731183 in boost::_bi::bind_t<void, boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value > >::operator() (this=0x42c1d68) at /usr/include/boost/bind/bind_template.hpp:20\n"
~"#9 0x00007ffff773106a in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf5<void, caffe::InternalThread, int, caffe::Caffe::Brew, int, int, bool>, boost::_bi::list6boost::_bi::value<caffe::InternalThread*, boost::_bi::value, boost::_bi::valuecaffe::Caffe::Brew, boost::_bi::value, boost::_bi::value, boost::_bi::value > > >::run (this=0x42c1bb0) at /usr/include/boost/thread/detail/thread.hpp:117\n"
~"#10 0x00007ffff5a6fe7a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0\n"
~"#11 0x00007ffff584e184 in start_thread (arg=0x7fffd7fff700) at pthread_create.c:312\n"
~"#12 0x00007ffff5d7537d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111\n"

~"\nThread 3 (Thread 0x7fffddab4700 (LWP 10805)):\n"
~"#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185\n"
~"#1 0x00007fffdf730a3d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1\n"
~"#2 0x00007fffdf03bfcc in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1\n"
~"#3 0x00007fffdf730288 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1\n"
~"#4 0x00007ffff584e184 in start_thread (arg=0x7fffddab4700) at pthread_create.c:312\n"
~"#5 0x00007ffff5d7537d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111\n"

~"\nThread 2 (Thread 0x7fffdeeb5700 (LWP 10804)):\n"
~"#0 0x00007ffff5d67fdd in poll () at ../sysdeps/unix/syscall-template.S:81\n"
~"#1 0x00007fffdf72f93b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1\n"
~"#2 0x00007fffdf0f5651 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1\n"
~"#3 0x00007fffdf730288 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1\n"
~"#4 0x00007ffff584e184 in start_thread (arg=0x7fffdeeb5700) at pthread_create.c:312\n"
~"#5 0x00007ffff5d7537d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111\n"

~"\nThread 1 (Thread 0x7ffff7f9e9c0 (LWP 10800)):\n"
~"#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185\n"
~"#1 0x00007ffff773ce25 in boost::condition_variable::wait (this=0x42bef78, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:73\n"
~"#2 0x00007ffff773e722 in caffe::BlockingQueuecaffe::Datum*::peek (this=0x42be9b0) at /data/Programming/Libraries/caffe/src/caffe/util/blocking_queue.cpp:77\n"
~"#3 0x00007ffff7856e11 in caffe::DataLayer::DataLayerSetUp (this=0xbc145d0, bottom=..., top=...) at /data/Programming/Libraries/caffe/src/caffe/layers/data_layer.cpp:30\n"
~"#4 0x00007ffff781b675 in caffe::BaseDataLayer::LayerSetUp (this=0xbc145d0, bottom=..., top=...) at /data/Programming/Libraries/caffe/src/caffe/layers/base_data_layer.cpp:32\n"
~"#5 0x00007ffff781bac2 in caffe::BasePrefetchingDataLayer::LayerSetUp (this=0xbc145d0, bottom=..., top=...) at /data/Programming/Libraries/caffe/src/caffe/layers/base_data_layer.cpp:48\n"
~"#6 0x00007ffff77b1872 in caffe::Layer::SetUp (this=0xbc145d0, bottom=..., top=...) at /data/Programming/Libraries/caffe/include/caffe/layer.hpp:71\n"
~"#7 0x00007ffff788d141 in caffe::Net::Init (this=0x42c2170, in_param=...) at /data/Programming/Libraries/caffe/src/caffe/net.cpp:148\n"
~"#8 0x00007ffff788b453 in caffe::Net::Net (this=0x42c2170, param=..., root_net=0x0) at /data/Programming/Libraries/caffe/src/caffe/net.cpp:27\n"
~"#9 0x00007ffff78807e1 in caffe::Solver::InitTestNets (this=0x73b630) at /data/Programming/Libraries/caffe/src/caffe/solver.cpp:184\n"
~"#10 0x00007ffff787f336 in caffe::Solver::Init (this=0x73b630, param=...) at /data/Programming/Libraries/caffe/src/caffe/solver.cpp:59\n"
~"#11 0x00007ffff787edba in caffe::Solver::Solver (this=0x73b630, param=..., root_solver=0x0) at /data/Programming/Libraries/caffe/src/caffe/solver.cpp:32\n"
~"#12 0x00007ffff7870db5 in caffe::SGDSolver::SGDSolver (this=0x73b630, param=...) at /data/Programming/Libraries/caffe/include/caffe/sgd_solvers.hpp:19\n"
~"#13 0x00007ffff787a6d9 in caffe::Creator_SGDSolver (param=...) at /data/Programming/Libraries/caffe/src/caffe/solvers/sgd_solver.cpp:350\n"
~"#14 0x000000000042ddc5 in caffe::SolverRegistry::CreateSolver (param=...) at /data/Programming/Libraries/caffe/include/caffe/solver_factory.hpp:78\n"
~"#15 0x00000000004290e2 in train () at /data/Programming/Libraries/caffe/tools/caffe.cpp:236\n"
~"#16 0x000000000042b639 in main (argc=2, argv=0x7fffffffe2b0) at /data/Programming/Libraries/caffe/tools/caffe.cpp:443\n"

@thatguymike
Copy link
Contributor

Looks symptomatic of your LMDB being locked. Are you loading the same DB for training and test set? Something else have it locked? You can try to recompile Caffe with read only DB support. See Makefile.config.

@shelhamer
Copy link
Member

This should be fixed by the above advice or trying with the newer multiprocess parallelism #4563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants