Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brpc+用户计算线程池卡住 #299

Closed
ChenChuang opened this issue Jun 10, 2021 · 1 comment
Closed

brpc+用户计算线程池卡住 #299

ChenChuang opened this issue Jun 10, 2021 · 1 comment

Comments

@ChenChuang
Copy link

ChenChuang commented Jun 10, 2021

Describe the bug (描述bug)
我们有一个计算引擎,需要在单独的线程池调用。因此,我们采用了如下的设计方案

  1. brpc负责收发消息,在rpc处理方法中,把消息转换为计算任务投递到一个全局队列中,然后通过 bthread::Mutex + bthread::ConditionVariable 等待任务完成(如下面代码中的 Wait 方法)
  2. 计算线程池(N*pthread)不断从全局队列中取出任务,进行计算后,通过 Done 方法通知正在等待的 rpc 处理 bthread
class Task {
  void Done() {
    {
      std::unique_lock<bthread::Mutex> lock(mutex_);
      done_ = true;
    }
    cond_.notify_one();
  }

  void Wait() {
    std::unique_lock<bthread::Mutex> lock(mutex_);
    while (!done_) {
      cond_.wait(lock);
    }
  }

  bthread::Mutex mutex_;
  bthread::ConditionVariable cond_;
  bool done_ = false;
}

我们想请教两个问题

  1. 这种同一个 bthread::Mutex/ConditionVariable 被 pthread 和 bthread 同时使用的方式,是否合理?
  2. 我们发现在高负载情况下,出现了卡死的情况,是否跟我们这种使用方式有关系?

卡死时持续滚动如下日志:
[ERROR] [2021-06-10 11:00:25.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
[ERROR] [2021-06-10 11:00:26.082] [52858#57107] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048
[ERROR] [2021-06-10 11:00:26.103] [52858#57195] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
[ERROR] [2021-06-10 11:00:27.082] [52858#57152] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048
[ERROR] [2021-06-10 11:00:27.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
[ERROR] [2021-06-10 11:00:28.082] [52858#57122] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048
[ERROR] [2021-06-10 11:00:28.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096

典型堆栈 [1](rpc处理线程卡在上面的 Wait 方法):
Thread 20 (Thread 0x7f84ee7fc700 (LWP 57277)):
#0 0x00007f8c40dbe809 in syscall () from /lib64/libc.so.6
#1 0x0000000001268b23 in futex_wait_private (timeout=0x0, expected=0, addr1=0x7f84ee7f5a40) at ./src/bthread/sys_futex.h:42
#2 bthread::wait_pthread (pw=..., ptimeout=ptimeout@entry=0x0) at src/bthread/butex.cpp:142
#3 0x0000000001269abc in butex_wait_from_pthread (abstime=0x0, expected_value=0, b=0x7f84dc801a40, g=) at src/bthread/butex.cpp:589
#4 bthread::butex_wait (arg=0x7f84dc801a40, expected_value=expected_value@entry=0, abstime=abstime@entry=0x0) at src/bthread/butex.cpp:622
#5 0x000000000118910e in bthread_cond_wait (c=0x7f84dc84d590, m=0x7f84dc84d578) at src/bthread/condition_variable.cpp:101
#6 0x0000000000c70310 in bthread::ConditionVariable::wait (this=0x7f84dc84d590, lock=...) at /brpc/include/bthread/condition_variable.h:60
#7 0x0000000000c7034b in common::Task::Wait (this=0x7f84dc84d578) at /src/common/pool/execute_queue.h:39
Python Exception <type 'exceptions.IndexError'> list index out of range:
#8 0x0000000000c6d38f in Searcher::Search (this=0x7f84ee7f5f80, group_candidates=std::map with 0 elements) at /src/retrieve/searcher.cpp:229
#9 0x0000000000c5e6d5 in SearchLogic::Retrieve (this=0x7ffd15ff74f8, request=0x7f84dc84bcc0, response=0x7f84dc84cea0) at /src/retrieve/search_logic.cpp:127
#10 0x0000000000c848c4 in RetrieveServiceImpl::Retrieve (this=0x7ffd15ff74f0, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
at /src/retrieve/service_impl.cpp:16
#11 0x0000000000d5f47d in RetrieveService::CallMethod (this=0x7ffd15ff74f0, method=0x49f9570, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
at /src/proto/retrieve_api.pb.cc:245
#12 0x0000000001323755 in brpc::policy::ProcessRpcRequest (msg_base=) at src/brpc/policy/baidu_rpc_protocol.cpp:499
#13 0x00000000012cb8ba in brpc::ProcessInputMessage (void_arg=) at src/brpc/input_messenger.cpp:136
#14 0x000000000118fb5f in bthread::TaskGroup::task_runner (skip_remained=skip_remained@entry=1) at src/bthread/task_group.cpp:297
#15 0x000000000119001b in bthread::TaskGroup::run_main_task (this=this@entry=0x7f84dc0008c0) at src/bthread/task_group.cpp:158
#16 0x0000000001266536 in bthread::TaskControl::worker_thread (arg=0x49df570) at src/bthread/task_control.cpp:77
#17 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f8c40dc435d in clone () from /lib64/libc.so.6

典型堆栈 [2](计算线程卡在上面的 Done 方法):
Thread 196 (Thread 0x7f8bb080d700 (LWP 57094)):
#0 0x00007f8c40d8b1bd in nanosleep () from /lib64/libc.so.6
#1 0x00007f8c40dbbed4 in usleep () from /lib64/libc.so.6
#2 0x000000000118e046 in bthread::TaskGroup::ready_to_run_remote (this=0x7f85980008c0, tid=tid@entry=51539635585, nosignal=nosignal@entry=false) at src/bthread/task_group.cpp:675
#3 0x000000000126910a in bthread::butex_wake (arg=) at src/bthread/butex.cpp:287
#4 0x0000000001189071 in bthread_cond_signal (c=) at src/bthread/condition_variable.cpp:69
#5 0x0000000000bf85b8 in bthread::ConditionVariable::notify_one (this=0x7f85dc28f680) at /data/devops/workspace/yt-industry-ai/zeus/p-8ab35777b3814c8e843aa982bee6e16a/third_path/brpc/include/bthread/condition_variable.h:94
#6 0x0000000000bf86e6 in common::Task::Done (this=0x7f85dc28f668, task_ret=0) at /src/common/pool/execute_queue.h:33
#7 0x0000000000c75cb5 in common::ExecuteQueue::ThreadLoop (this=0x4bf4d90, idx=3) at /src/common/pool/execute_queue.h:229
#8 0x0000000000c72608 in common::ExecuteQueue::InitAndStartThreads()::{lambda()#1}::operator()() const (__closure=0x4c2c130)
at /src/common/pool/execute_queue.h:142
#9 0x0000000000c7f8b2 in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1732
#10 0x0000000000c7f7bf in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::operator()() (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1720
#11 0x0000000000c7f61e in std::thread::_Impl<std::_Bind_simple<common::ExecuteQueueyoutu::zeus::SearchTask::InitAndStartThreads()::{lambda()#1} ()> >::_M_run() (this=0x4c2c118) at /usr/include/c++/4.8.2/thread:115
#12 0x00007f8c4165d220 in ?? () from /lib64/libstdc++.so.6
#13 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f8c40dc435d in clone () from /lib64/libc.so.6

To Reproduce (复现方法)
高负载后可能出现

Expected behavior (期望行为)
负载降低后,服务可自动恢复正常,不要一直卡住

Versions (各种版本)
OS: centos7
Compiler: gcc 4.8.5
brpc: 0.9.6
protobuf: 3.6.1

Additional context/screenshots (更多上下文/截图)

@AdiaLoveTrance
Copy link

有解决吗

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants