Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkerd-tcp process hangs in mutex lock after long load test. #14

Open
stevej opened this issue Mar 30, 2017 · 2 comments
Open

linkerd-tcp process hangs in mutex lock after long load test. #14

stevej opened this issue Mar 30, 2017 · 2 comments
Assignees

Comments

@stevej
Copy link

stevej commented Mar 30, 2017

I setup a long-running load test against linkerd-tcp with slow_cooker with the following parameters: 3 nginx backends serving static content and 3 slow_cooker load testers with the following command-lines:

slow_cooker -host "perf-cluster" -qps 100 -concurrency 100 -interval 1m -reportLatenciesCSV linkerd-tcp.csv http://proxy-test-4d:7474
slow_cooker -host "perf-cluster" -qps 1 -concurrency 100 -interval 1s -hashValue 9745850253931700627 -hashSampleRate 0.1 http://proxy-test-4d:7474
slow_cooker -host perf-cluster -qps 1 -concurrency 100 -interval 1m -hashValue 9745850253931700627 -hashSampleRate 0.1 -compress http://proxy-test-4d:7474

So this is about (10k + 200) qps with the second two slow_cookers checking for invalid response bodies.

After about 20 hours, linkerd-tcp stopped accepting new connections and stopped processing requests from existing connections. slow_cooker was reporting the following errors

Get http://proxy-test-4d:7474: read tcp 10.240.0.4:58016->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57899->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57902->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57906->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57853->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57912->10.240.0.21:7474: read: connection reset by peer
2017-03-30T16:51:02Z      0/0/55 6000   0% 1m0s   0 [  0   0   0    0 ]    0      0
Get http://proxy-test-4d:7474: dial tcp 10.240.0.21:7474: getsockopt: connection timed out
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57907->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57908->10.240.0.21:7474: read: connection reset by peer

Interesting, the admin port was still processing requests and I setup a watch to look at output every few seconds. Here's an example report

$ curl http://localhost:9989/metrics
metrics_count 12
success_count{proxy="default"} 79470054
connects{proxy="default", srv="0.0.0.0:7474"} 79470070
bytes_total{direction="tx", proxy="default"} 2382121311667
namerd_success_count{service="namerd"} 61943
conns_active{proxy="default"} 5
endpoints_retired{proxy="default"} 0
endpoints_ready{proxy="default"} 0
conns_established{proxy="default"} 0
conns_pending{proxy="default"} 0
endpoints_unready{proxy="default"} 3
poll_time_us{stat="count", proxy="default"} 6
poll_time_us{stat="mean", proxy="default"} 17.833333333333332
poll_time_us{stat="min", proxy="default"} 6
poll_time_us{stat="max", proxy="default"} 44
poll_time_us{stat="stddev", proxy="default"} 14.217555658019732
poll_time_us{stat="p50", proxy="default"} 6
poll_time_us{stat="p90", proxy="default"} 6
poll_time_us{stat="p95", proxy="default"} 6
poll_time_us{stat="p99", proxy="default"} 6
poll_time_us{stat="p999", proxy="default"} 6
poll_time_us{stat="p9999", proxy="default"} 6
namerd_request_latency_ms{stat="count", service="namerd"} 2
namerd_request_latency_ms{stat="mean", service="namerd"} 2
namerd_request_latency_ms{stat="min", service="namerd"} 2
namerd_request_latency_ms{stat="max", service="namerd"} 2
namerd_request_latency_ms{stat="stddev", service="namerd"} 0
namerd_request_latency_ms{stat="p50", service="namerd"} 2
namerd_request_latency_ms{stat="p90", service="namerd"} 2
namerd_request_latency_ms{stat="p95", service="namerd"} 2
namerd_request_latency_ms{stat="p99", service="namerd"} 2
namerd_request_latency_ms{stat="p999", service="namerd"} 2
namerd_request_latency_ms{stat="p9999", service="namerd"} 2

The process had a few more threads than I remember from the previous day.

stevej   13505 79.0  0.8  71776 33100 pts/2    Sl+  Mar27 4006:56 ./linkerd-tcp-1490585634 example.yml
stevej   13505 56.2  0.8  71776 33100 pts/2    Sl+  Mar27 2854:06 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:01 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:01 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:01 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:01 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:07 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:03 ./linkerd-tcp-1490585634 example.yml

3 of them were locked on the same mutex. Here's some output from rust-gdb. Unfortunately the process segfaulted before I captured backtraces from all the threads but I do have some output I can share:

Attaching to process 13505
[New LWP 13506]
[New LWP 13507]
[New LWP 13508]
[New LWP 13509]
[New LWP 13510]
[New LWP 13511]
[New LWP 13512]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f0ee2f716a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.

(gdb) thread 3
[Switching to thread 3 (Thread 0x7f0ee1fff700 (LWP 13507))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f0ee3450ebd in __GI___pthread_mutex_lock (mutex=0x7f0ee200e150)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x000056016c9a0389 in std::panicking::try::do_call::hb458ba233f0b55f7 ()
#3  0x000056016c9c2f7b in panic_unwind::__rust_maybe_catch_panic ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
#4  0x000056016c9a4893 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hc7b55c280fa43c1f ()
#5  0x000056016c9bad95 in alloc::boxed::{{impl}}::call_once<(),()> ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/liballoc/boxed.rs:624
#6  std::sys_common::thread::start_thread ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/thread.rs:21
#7  std::sys::imp::thread::{{impl}}::new::thread_start ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/thread.rs:84
#8  0x00007f0ee344e6ca in start_thread (arg=0x7f0ee1fff700) at pthread_create.c:333
#9  0x00007f0ee2f710af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f0ee1bfd700 (LWP 13509))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f0ee3450ebd in __GI___pthread_mutex_lock (mutex=0x7f0ee200e150)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x000056016c9a0389 in std::panicking::try::do_call::hb458ba233f0b55f7 ()
#3  0x000056016c9c2f7b in panic_unwind::__rust_maybe_catch_panic ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
#4  0x000056016c9a4893 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hc7b55c280fa43c1f ()
#5  0x000056016c9bad95 in alloc::boxed::{{impl}}::call_once<(),()> ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/liballoc/boxed.rs:624
#6  std::sys_common::thread::start_thread ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/thread.rs:21
#7  std::sys::imp::thread::{{impl}}::new::thread_start ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/thread.rs:84
#8  0x00007f0ee344e6ca in start_thread (arg=0x7f0ee1bfd700) at pthread_create.c:333
#9  0x00007f0ee2f710af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f0ee19fc700 (LWP 13510))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     in ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f0ee3450ebd in __GI___pthread_mutex_lock (mutex=0x7f0ee200e150)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x000056016c9a0389 in std::panicking::try::do_call::hb458ba233f0b55f7 ()
#3  0x000056016c9c2f7b in panic_unwind::__rust_maybe_catch_panic ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
#4  0x000056016c9a4893 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hc7b55c280fa43c1f ()
#5  0x000056016c9bad95 in alloc::boxed::{{impl}}::call_once<(),()> ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/liballoc/boxed.rs:624
#6  std::sys_common::thread::start_thread ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/thread.rs:21
#7  std::sys::imp::thread::{{impl}}::new::thread_start ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/thread.rs:84
#8  0x00007f0ee344e6ca in start_thread (arg=0x7f0ee19fc700) at pthread_create.c:333
#9  0x00007f0ee2f710af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

Some netstat output. I noticed that the Recv-Q for the listener has bytes in it. That's unexpected

sudo netstat -anp |head -n 2 && sudo netstat -anp |grep linkerd
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp      129      0 0.0.0.0:7474            0.0.0.0:*               LISTEN      13505/./linkerd-tcp
tcp        0      0 127.0.0.1:9989          0.0.0.0:*               LISTEN      13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48796        CLOSE_WAIT  13505/./linkerd-tcp
tcp        0      0 10.240.0.21:7474        10.240.0.14:60982       ESTABLISHED 13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48788        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48790        CLOSE_WAIT  13505/./linkerd-tcp
tcp        0      0 10.240.0.21:7474        10.240.0.14:34318       ESTABLISHED 13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48791        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48793        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48800        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48795        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48785        CLOSE_WAIT  13505/./linkerd-tcp
tcp        0      0 127.0.0.1:48284         127.0.0.1:4180          ESTABLISHED 13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48798        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48789        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48799        CLOSE_WAIT  13505/./linkerd-tcp

ltrace there was movement on the process and something suspicious stood out.
read(4, 0x7fffd9e9dfb0, 128) = -1 EAGAIN (Resource temporarily unavailable)
lsof showed the following FD connected to 4.
linkerd-t 13505 stevej 4r FIFO 0,10 0t0 89104520 pipe

Unfortunately, the process segfaulted after more ltraceing and most of my output was lost to a small tmux scrollback buffer.

Here's the segfault I did capture. Note that the activity here is the admin port writing out prometheus stats to my curl request in a watch loop

memcpy(0xe25fdc40, "\240\241\242\243\244\245\246\247\250\251\252\253", 12) = 0xe25fdc40
memcpy(0x7fffd9e9d192, "\377", 1)                                  = 0x7fffd9e9d192
memcpy(0xe25fdc4c, "\347", 1)                                      = 0xe25fdc4c
memcpy(0x7fffd9e9d193, "default", 7)                               = 0x7fffd9e9d193
memcpy(0xe25fdc4d, "\220\221\222\223\224", 5 <unfinished ...>
memcpy(0x7fffd9e9d19a, "\377", 1 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc4d
<... memcpy resumed> )                                             = 0x7fffd9e9d19a
memcpy(0xe25fdc52, "\347", 1 <unfinished ...>
memcpy(0x7fffd9e9d180, "conns_active", 12 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc52
<... memcpy resumed> )                                             = 0x7fffd9e9d180
memcpy(0xe25fdc53, "\230\231\232\233\234\235\236", 7 <unfinished ...>
memcpy(0x7fffd9e9d18c, "\377", 1 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc53
<... memcpy resumed> )                                             = 0x7fffd9e9d18c
memcpy(0xe25fdc5a, "\347", 1 <unfinished ...>
memcpy(0x7fffd9e9d18d, "proxy", 5 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc5a
<... memcpy resumed> )                                             = 0x7fffd9e9d18d
memcpy(0xe25fdc5b, "\b\t\n\v\f", 5 <unfinished ...>
memcpy(0x7fffd9e9d192, "\377", 1 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc5b
<... memcpy resumed> )                                             = 0x7fffd9e9d192
memcpy(0xe25fdc40, "\r\016\017", 3 <unfinished ...>
memcpy(0x7fffd9e9d193, "default", 7 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc40
<... memcpy resumed> )                                             = 0x7fffd9e9d193
memcpy(0xe25fdc43, "\020\021\022\023\024\025\026\027", 8)          = 0xe25fdc43
unexpected breakpoint at 0x56016c897de8
log2(8, 31, 0xe2121000, 1 <no return ...>
--- SIGSEGV (Segmentation fault) ---
<... log2 resumed> )                                               = 2
sigaction(SIGSEGV, { 0, <>, 0, 0 } <unfinished ...>
floor(0, 0xfffffc01, 0, 0)                                         = 2
pthread_mutex_lock(0xe2912288, 0xe2912280, 52, 0)                  = 0
<... sigaction resumed> , nil)                                     = 0
pthread_mutex_unlock(0xe2912288, 0xe20058e8, 0xe2914080, 0x6da8)   = 0
pthread_mutex_lock(0xe2912288, 0xe214bd00, 0xa0000, 0xe2912288)    = 0
--- SIGSEGV (Segmentation fault) ---
enable_breakpoint pid=13506, addr=0x56016c9ceca2, symbol=(null): No such process
pthread_mutex_unlock(Segmentation fault (core dumped)

I will put together a tool to gather full output from rust-gdb, lsof, and some strace and ltrace samples for the next time this pops up.

@stevej stevej self-assigned this Mar 30, 2017
@stevej
Copy link
Author

stevej commented Mar 30, 2017

Looking at a healthy linkerd, I also see 3 threads all waiting on the same mutex so that must not be what is causing this issue.

@stevej
Copy link
Author

stevej commented Mar 30, 2017

Attached is a txt file of rust-gdb output of backtraces of threads on a healthy linkerd-tcp.

healthy_linkerd_gdb.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant