Streams should check for Session shutdown when waiting for data & clean up timers #127

lattwood · 2024-06-13T17:50:32Z

Stream.Close is called when the Session's keepalive fails. This call wakes up Stream.Read & Stream.write waiters. The wakeup is via a non-blocking send to recvNotifyCh/sendNotifyCh, which are unbuffered channels.

In Stream.Read this send can be attempted between the call to s.stateLock.Unlock() and the select statement at the bottom of the method. When this occurs, if deadlines haven't been set on the Stream in question, the call to Stream.Read will block forever.

By adding Session.shutdownCh to the channels in the select statement, this coordination gap is addressed.

This uses goto START instead of return 0, io.EOF, as that would potentially leave a last read on the buffer in the Read case.

Should fix hashicorp/nomad#23305.
Definitely will fix a bug about allocations getting stuck in the pending state until the Nomad Agent is restarted that I haven't filed.

The issue was discovered by sending a "broken" Nomad Agent a SIGQUIT, and seeing the goroutine responsible for fetching allocation updates from the Nomad Servers was blocked in the select statement at the bottom of Stream.Read for > 24 hours (or more specifically, how long the allocation was in the pending state), with a lot of stuff about msgpack and jobspec fields in the middle of the stacktrace.

lattwood · 2024-06-13T21:44:43Z

~~I need to update this PR, I just realized the goto implementation would protect against missing a final read in the happy case on shutdown.~~

Done

schmichael

Thank you for your excellent PR and debugging work!

stream.go

lattwood · 2024-06-17T17:45:48Z

One thing that does alarm me is how Stream.recvNotifyCh is used.

It's only ever blocked on in Read. The only place that should need to unblock it is in Stream.recvData, called by Session.handleStreamMessage when it's de-multiplexing the connection into individual streams.

But it's also used for...

after SetReadDeadline is called (mostly makes sense, this allows for changing the timeout settings while something is calling Stream.Read)
asynchronously after Stream.Read returns, through defer asyncNotify(s.recvNotifyCh)
- yamux Streams are not threadsafe, so there is no reason that should ever wake anything up, because there should only be one call to Read at a time anyways. (and I don't think that go guarantees the first goroutine to block on a channel is the first one do get a message)
  - they're not threadsafe because the Read call that gets woken up is non-deterministic. See example here: https://go.dev/play/p/4QR3HvLoGyz
- this is the one that makes me think there's more synchronization bugs in the code :(
inside notifyWaiting, which also does non-blocking sends to other synchronization channels.
- used as a catch-all around connection state changes.

lattwood · 2024-06-17T18:02:27Z

PR feedback addressed, I think.

lattwood · 2024-06-17T18:13:01Z

https://github.com/lattwood/nomad/releases/tag/v1.7.8-yamux-fork

for anyone following along at home, there's a zip with a linux amd64 nomad build with this fix.

if you're paranoid i recommend just building it yourself, however.

lattwood · 2024-06-18T18:54:11Z

~~So all hell broke loose while rolling out the custom Nomad binary.~~

Now that things have stabilized, going to see if this issue continues to appear.

lattwood · 2024-06-20T15:38:46Z

All hell broke loose due to a provider network outage.

After 24 hours with this custom fork of Nomad deployed to both a ~100 node cluster, and a ~2300 node cluster, we do not have any stuck/pending allocs.

Confirmed, this fixes hashicorp/nomad#23305.

lattwood · 2024-06-26T17:38:41Z

*confirmed, it improves the bug.

still happening, sadly, and the output of killall -SIGQUIT nomad points to a hung Read on the yamux tunnel again.

When I'm back from vacation there's a good chance I'm going to just shoehorn a timer of some kind into yamux's Read method to get it out of that switch statement at the bottom of the method.

Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"yamux: keepalive failed: i/o deadline reached","@module":"client","@timestamp":"2024-06-26T17:17:21.287415Z"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"yamux: Failed to read stream data: read tcp [IPv6]:4647: use of closed network connection","@module":"client","@timestamp":"2024-06-26T17:17:21.288316Z"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"warn","@message":"yamux: failed to send go away: session shutdown","@module":"client","@timestamp":"2024-06-26T17:17:21.301811Z"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-26T17:17:21.322083Z","error":"rpc error: EOF","rpc":"Node.Register","server":{"IP":"2a05:IPV6","Port":4647,"Zone":""}}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error performing RPC to server which is not safe to automatically retry","@module":"client.rpc","@timestamp":"2024-06-26T17:17:21.322195Z","error":"rpc error: EOF","rpc":"Node.Register","server":{"IP":"2a05:IPV6","Po>
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error registering","@module":"client","@timestamp":"2024-06-26T17:17:21.322214Z","error":"rpc error: EOF"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-26T17:17:21.322165Z","error":"rpc error: EOF","rpc":"Alloc.GetAllocs","server":{"IP":"2a05:IPv6","Port":4647,"Zone":""}}

Then the node gets marked as Down in Nomad, but the live CPU utilization stuff continues to get reported to the UI.

schmichael · 2024-07-01T23:39:37Z

Thanks for the update @lattwood! I just wanted to let you know we still intend to merge this PR after some more internal validation across teams since many HashiCorp products rely on yamux. We're also planning on at least improving the archaic infrastructure to pass go test -race and go vet.

Lots of vacations this time of year for us as well, so it will probably be a couple weeks before we're able to merge this.

schmichael · 2024-08-02T23:33:22Z

Sorry for the delay! I've finally been testing this patch. Playing around for a while with tc and nodesim I haven't discovered any problems with the patch, but unfortunately I haven't reproduced the original Nomad issue either. I'm not too surprised I haven't reproduced this as there are a lot of variables at play and reproducing production network/compute conditions to cause race conditions is extremely challenging.

The fact that this patch seems to have solved your issue makes me lean toward merging this even if we can't come to a satisfactory explanation of why.

The wakeup is via a non-blocking send to recvNotifyCh/sendNotifyCh, which are unbuffered channels.

(Emphasis mine) These are buffered channels: https://github.com/hashicorp/yamux/blob/v0.1.1/stream.go#L75-L76

I don't see a way for Stream.Read to miss a send on recvNotifyCh as there's only one receive: https://github.com/hashicorp/yamux/blob/v0.1.1/stream.go#L146

I know it's been a long time, but do you still have a goroutine dump handy? Is it possible another goroutine was blocked in session.go for a similar amount of time? It seems likely not, but perhaps there's a way Session.Close() can get deadlocked between close(s.shutdownCh) (which your PR would now check and avoid the deadlock) and stream.forceClose() (which should always send on recvNotifyCh in a way that unblocks Stream.Read()).

I tried to write a test to exercise the recvNotify is insufficient to notice disconnects but can't get it to fail on v0.1.1 as I would expect. Any chance you could take a look and see if you can find a way to make it fail? https://gist.github.com/schmichael/eae76f127aa31475fde152a47ef4a38b

I sprinkled some fmt.Println()s into Stream.Read() as well to try to ensure the Read was waiting before I closed the Session, but I might be missing something.

schmichael · 2024-08-06T18:31:13Z

Argh sorry that #130 caused a merge conflict. Tests now pass with the race detector enabled, and we finally enabled GHA CI!

We've been testing your patch more internally, so I wanted to leave a quick update:

My test above tried to exercise the bug by explicitly calling Session.Close, but @rboyer pointed out a missed yamux heartbeat is probably key to exercising the bug.
@rboyer has observed a (the?) bug in his own test, but not consistently yet. So far it appears this patch fixes it!
Concurrent calls to Stream.Read can return different data #128 may be relevant here. If Nomad is doing that, then that's a bug in Nomad we need to fix. I intend to dig into that a bit. That being said, even if there's a bug due to yamux misuse in the Nomad code base, I'm not opposed to merging a patch in yamux that makes the misuse less pathological.

I'm hoping to get this all finished up this week so it makes it into the next Nomad patch release.

schmichael

After a lot of attempts I have been unable to cause the bug this appears to fix.

In theory if io.ReadFull blocked indefinitely that could prevent Session.Close() from closing each Stream. This would leave the yamux Session in the precise state being seen here: Session.shutdownCh is closed, but individual streams have not been notified.

If that case is possible, this patch fixes it. Unfortunately the only way I was able to replicate that case was when disabling client write timeouts (as well as carefully blocking server writes to deadlock keepalive handling) which is not the case in the Nomad environment in the original bug report: hashicorp/nomad#23305

Goroutine dumps observing this bug would be extremely useful and may be reported privately to: nomad-oss-debug@hashicorp.com

After extensive testing I have found no downside to accepting this patch and intend to merge it. Leaving open a few more days to let other folks have a chance to review.

See #131 for some of the changes I made to the test suite while testing this. Note that those changes are not inclusive: I also did a lot of manual testing directly with yamux as well as via Nomad. At no point could I exercise either a bug or a problem with this PR.

Thanks @lattwood, and I'm very sorry for the long delay!

lattwood · 2024-08-21T15:57:25Z

@schmichael so, without revealing too much, what we're doing with Nomad regularly sees us have our hosting providers firewall off the Nomad Masters and the Nomad Agents for upwards of 30 minutes at a time. Agents can initiate TCP connections to the Masters when this happens, but all traffic from the Masters to the Agents becomes blocked. So the client SYN makes it to the server, but the server SYN-ACK doesn't leave the provider's network.

I think that could actually manifest as io.ReadFull blocking on the line in question, since that's reading from a bufio.Reader from the conn passed in to newSession which comes from the io.ReadWriteCloser passed to Server or Client. If there's no deadline set on the underlying net.Conn connection, when that firewalling of traffic occurs, it would block at least until the OS realizes the other side of the TCP connection has gone away.

So I think you're onto something here.

With all that said, we are still seeing the issue occur periodically, but now we don't have such a smoking gun in the form of yamux.

I'll see about grabbing some goroutine stacktraces the next time I'm kicking nodes.

And I'll get this PR's conflicts addressed.

lattwood · 2024-08-21T16:00:03Z

~ > nomad job status -json | jq -r '.[].Allocations[] | select(.ClientStatus == "pending") | .NodeName' | wc -l
      37

How many do ya want? :D

schmichael · 2024-08-21T16:07:38Z

With all that said, we are still seeing the issue occur periodically, but now we don't have such a smoking gun in the form of yamux.

Hm, phooey. Is it significantly better than without the yamux patch?

How many do ya want? :D

Working on a distributed system written in Go means the team is pretty used to combing through huge mountains of goroutines, so share whatever you're able. 😁

lattwood · 2024-08-21T16:15:25Z

It has been better. Here's (what I think is) the relevant snippet from the goroutine stacks-

blocked on a read due to no io deadline? and this is the goroutine running watchAllocations

lattwood · 2024-08-21T16:22:28Z

@schmichael stacks sent.

It's a fork of Nomad 1.7.7 using this version of yamux- https://github.com/lattwood/yamux/tree/shutdown_edgecase (not the rebased branch for this PR)

#!/usr/bin/env bash

set -euo pipefail
IFS=$'\n\t'

target=$1
ssh "root@${target}" 'killall -SIGQUIT nomad && sleep 3'
ssh "root@${target}" 'journalctl --unit nomad.service --since "5 minutes ago" > gostacks.txt'
scp "root@${target}:gostacks.txt" "out/${target}_stacks.txt"

> parallel -j16 './get_stack.sh {}' ::: `nomad job status -json | jq -r '.[].Allocations[] | select(.ClientStatus == "pending") | .NodeName' | sort -u

Stream.Close is called when the Session's keepalive fails. This call wakes up Stream.Read & Stream.write waiters. The wakeup is via a non-blocking send to recvNotifyCh/sendNotifyCh, which are unbuffered channels. In Stream.Read this send can be attempted between the call to s.stateLock.Unlock() and the select statement at the bottom of the method. When this occurs, if deadlines haven't been set on the Stream in question, the call to Stream.Read will block forever. By adding Session.shutdownCh to the channels in the select statement, this coordination gap is addressed. This uses `goto START` instead of `return 0, io.EOF`, as that would potentially leave a last read on the buffer in the Read case.

In hashicorp#31, the Read path was changed to move away from time.After. This change was not reflected in the Write path, and this commit rectifies that.

When stopping a time.Timer, you need to check the return value and drain the channel to prevent a memory leak, if the Timer has fired.

lattwood · 2024-08-21T16:38:21Z

PR rebased against hashicorp/yamux/main

lattwood · 2024-08-21T16:41:34Z

(this patch does reduce the frequency of it getting blocked tho)

lattwood · 2024-08-27T14:38:09Z

@schmichael I think this is good to merge at this point?

Also, any chance you happened to look at that yamux fork mentioned on the Nomad issue about potentially swapping out for SPDY, in terms of compatibility issues? If it doesn't seem like a horrible idea, I think we could be open to running a Nomad fork on our end to kick the tires on it as it relates to solving the issue we're experiencing.

lattwood · 2024-09-03T14:44:51Z

Friendly poke @schmichael

schmichael · 2024-09-05T18:05:05Z

Peeked at a few of the goroutine dumps you sent and something is clearly blocking Session.recvLoop in io.ReadFull which without this PR could cause Stream.Read to block even when the session is closed.

Merging! Thanks for your extensive help and even more extensive patience @lattwood. We can continue debugging on the Nomad issue.

The main point of this dependency upgrade is to pull in the fixes in hashicorp/yamux#127 which prevents leaking deadlocked goroutines. It has been observed to improve the issue in #23305 but does not seem sufficient to fix it entirely. Since touching yamux is a rare and scary event, I do **not** intend to backport this. If we discover the improvements are stable and significant enough, or if further fixes land in yamux, backporting can be done at that time.

lattwood mentioned this pull request Jun 13, 2024

pending allocations stuck in pending state after adoption by a new deployment hashicorp/nomad#23305

Open

lattwood force-pushed the lattwood/handle_shutdown_edgecase branch from 7eab13b to 9aa57cd Compare June 14, 2024 14:37

lattwood mentioned this pull request Jun 14, 2024

use a timer instead of 'After' to avoid leaking resources #31

Merged

schmichael reviewed Jun 17, 2024

View reviewed changes

stream.go Show resolved Hide resolved

lattwood force-pushed the lattwood/handle_shutdown_edgecase branch from 9aa57cd to 5cd99ad Compare June 17, 2024 17:50

lattwood changed the title ~~Streams should check for Session shutdown when waiting for data~~ Streams should check for Session shutdown when waiting for data & clean up timers Jun 17, 2024

This was referenced Aug 16, 2024

Allow half-closed reads and test against TCP/TLS connections #131

Merged

Why does Stream.Read notify recvNotifyCh? #132

Open

schmichael approved these changes Aug 17, 2024

View reviewed changes

lattwood added 3 commits August 21, 2024 13:34

Align Write timeout handling with Read timeout handling

eedcf95

In hashicorp#31, the Read path was changed to move away from time.After. This change was not reflected in the Write path, and this commit rectifies that.

Drain the timer's channel if needed, to prevent memory leakage

84b3fc6

When stopping a time.Timer, you need to check the return value and drain the channel to prevent a memory leak, if the Timer has fired.

lattwood force-pushed the lattwood/handle_shutdown_edgecase branch from 5cd99ad to 84b3fc6 Compare August 21, 2024 16:37

schmichael merged commit b5c3b44 into hashicorp:master Sep 5, 2024
3 checks passed

lattwood deleted the lattwood/handle_shutdown_edgecase branch September 5, 2024 18:41

schmichael mentioned this pull request Sep 24, 2024

deps: upgrade yamux from v0.1.1 to v0.1.2 hashicorp/nomad#24058

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streams should check for Session shutdown when waiting for data & clean up timers #127

Streams should check for Session shutdown when waiting for data & clean up timers #127

lattwood commented Jun 13, 2024 •

edited

Loading

lattwood commented Jun 13, 2024 •

edited

Loading

schmichael left a comment

lattwood commented Jun 17, 2024 •

edited

Loading

lattwood commented Jun 17, 2024

lattwood commented Jun 17, 2024

lattwood commented Jun 18, 2024 •

edited

Loading

lattwood commented Jun 20, 2024 •

edited

Loading

lattwood commented Jun 26, 2024

schmichael commented Jul 1, 2024

schmichael commented Aug 2, 2024 •

edited

Loading

schmichael commented Aug 6, 2024

schmichael left a comment

lattwood commented Aug 21, 2024

lattwood commented Aug 21, 2024

schmichael commented Aug 21, 2024

lattwood commented Aug 21, 2024

lattwood commented Aug 21, 2024 •

edited

Loading

lattwood commented Aug 21, 2024

lattwood commented Aug 21, 2024

lattwood commented Aug 27, 2024

lattwood commented Sep 3, 2024

schmichael commented Sep 5, 2024

Streams should check for Session shutdown when waiting for data & clean up timers #127

Streams should check for Session shutdown when waiting for data & clean up timers #127

Conversation

lattwood commented Jun 13, 2024 • edited Loading

lattwood commented Jun 13, 2024 • edited Loading

schmichael left a comment

Choose a reason for hiding this comment

lattwood commented Jun 17, 2024 • edited Loading

lattwood commented Jun 17, 2024

lattwood commented Jun 17, 2024

lattwood commented Jun 18, 2024 • edited Loading

lattwood commented Jun 20, 2024 • edited Loading

lattwood commented Jun 26, 2024

schmichael commented Jul 1, 2024

schmichael commented Aug 2, 2024 • edited Loading

schmichael commented Aug 6, 2024

schmichael left a comment

Choose a reason for hiding this comment

lattwood commented Aug 21, 2024

lattwood commented Aug 21, 2024

schmichael commented Aug 21, 2024

lattwood commented Aug 21, 2024

lattwood commented Aug 21, 2024 • edited Loading

lattwood commented Aug 21, 2024

lattwood commented Aug 21, 2024

lattwood commented Aug 27, 2024

lattwood commented Sep 3, 2024

schmichael commented Sep 5, 2024

lattwood commented Jun 13, 2024 •

edited

Loading

lattwood commented Jun 13, 2024 •

edited

Loading

lattwood commented Jun 17, 2024 •

edited

Loading

lattwood commented Jun 18, 2024 •

edited

Loading

lattwood commented Jun 20, 2024 •

edited

Loading

schmichael commented Aug 2, 2024 •

edited

Loading

lattwood commented Aug 21, 2024 •

edited

Loading