Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams should check for Session shutdown when waiting for data & clean up timers #127

Merged

Conversation

lattwood
Copy link
Contributor

@lattwood lattwood commented Jun 13, 2024

Stream.Close is called when the Session's keepalive fails. This call wakes up Stream.Read & Stream.write waiters. The wakeup is via a non-blocking send to recvNotifyCh/sendNotifyCh, which are unbuffered channels.

In Stream.Read this send can be attempted between the call to s.stateLock.Unlock() and the select statement at the bottom of the method. When this occurs, if deadlines haven't been set on the Stream in question, the call to Stream.Read will block forever.

By adding Session.shutdownCh to the channels in the select statement, this coordination gap is addressed.

This uses goto START instead of return 0, io.EOF, as that would potentially leave a last read on the buffer in the Read case.

Should fix hashicorp/nomad#23305.
Definitely will fix a bug about allocations getting stuck in the pending state until the Nomad Agent is restarted that I haven't filed.


The issue was discovered by sending a "broken" Nomad Agent a SIGQUIT, and seeing the goroutine responsible for fetching allocation updates from the Nomad Servers was blocked in the select statement at the bottom of Stream.Read for > 24 hours (or more specifically, how long the allocation was in the pending state), with a lot of stuff about msgpack and jobspec fields in the middle of the stacktrace.

@lattwood
Copy link
Contributor Author

lattwood commented Jun 13, 2024

I need to update this PR, I just realized the goto implementation would protect against missing a final read in the happy case on shutdown.

Done

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your excellent PR and debugging work!

stream.go Show resolved Hide resolved
@lattwood
Copy link
Contributor Author

lattwood commented Jun 17, 2024

One thing that does alarm me is how Stream.recvNotifyCh is used.

It's only ever blocked on in Read. The only place that should need to unblock it is in Stream.recvData, called by Session.handleStreamMessage when it's de-multiplexing the connection into individual streams.

But it's also used for...

  • after SetReadDeadline is called (mostly makes sense, this allows for changing the timeout settings while something is calling Stream.Read)
  • asynchronously after Stream.Read returns, through defer asyncNotify(s.recvNotifyCh)
    • yamux Streams are not threadsafe, so there is no reason that should ever wake anything up, because there should only be one call to Read at a time anyways. (and I don't think that go guarantees the first goroutine to block on a channel is the first one do get a message)
    • this is the one that makes me think there's more synchronization bugs in the code :(
  • inside notifyWaiting, which also does non-blocking sends to other synchronization channels.
    • used as a catch-all around connection state changes.

@lattwood lattwood force-pushed the lattwood/handle_shutdown_edgecase branch from 9aa57cd to 5cd99ad Compare June 17, 2024 17:50
@lattwood lattwood changed the title Streams should check for Session shutdown when waiting for data Streams should check for Session shutdown when waiting for data & clean up timers Jun 17, 2024
@lattwood
Copy link
Contributor Author

PR feedback addressed, I think.

@lattwood
Copy link
Contributor Author

https://github.com/lattwood/nomad/releases/tag/v1.7.8-yamux-fork

for anyone following along at home, there's a zip with a linux amd64 nomad build with this fix.

if you're paranoid i recommend just building it yourself, however.

@lattwood
Copy link
Contributor Author

lattwood commented Jun 18, 2024

So all hell broke loose while rolling out the custom Nomad binary.

Now that things have stabilized, going to see if this issue continues to appear.

@lattwood
Copy link
Contributor Author

lattwood commented Jun 20, 2024

All hell broke loose due to a provider network outage.

After 24 hours with this custom fork of Nomad deployed to both a ~100 node cluster, and a ~2300 node cluster, we do not have any stuck/pending allocs.

Confirmed, this fixes hashicorp/nomad#23305.

@lattwood
Copy link
Contributor Author

*confirmed, it improves the bug.

still happening, sadly, and the output of killall -SIGQUIT nomad points to a hung Read on the yamux tunnel again.

When I'm back from vacation there's a good chance I'm going to just shoehorn a timer of some kind into yamux's Read method to get it out of that switch statement at the bottom of the method.

Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"yamux: keepalive failed: i/o deadline reached","@module":"client","@timestamp":"2024-06-26T17:17:21.287415Z"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"yamux: Failed to read stream data: read tcp [IPv6]:4647: use of closed network connection","@module":"client","@timestamp":"2024-06-26T17:17:21.288316Z"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"warn","@message":"yamux: failed to send go away: session shutdown","@module":"client","@timestamp":"2024-06-26T17:17:21.301811Z"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-26T17:17:21.322083Z","error":"rpc error: EOF","rpc":"Node.Register","server":{"IP":"2a05:IPV6","Port":4647,"Zone":""}}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error performing RPC to server which is not safe to automatically retry","@module":"client.rpc","@timestamp":"2024-06-26T17:17:21.322195Z","error":"rpc error: EOF","rpc":"Node.Register","server":{"IP":"2a05:IPV6","Po>
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error registering","@module":"client","@timestamp":"2024-06-26T17:17:21.322214Z","error":"rpc error: EOF"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-26T17:17:21.322165Z","error":"rpc error: EOF","rpc":"Alloc.GetAllocs","server":{"IP":"2a05:IPv6","Port":4647,"Zone":""}}

Then the node gets marked as Down in Nomad, but the live CPU utilization stuff continues to get reported to the UI.

@schmichael
Copy link
Member

Thanks for the update @lattwood! I just wanted to let you know we still intend to merge this PR after some more internal validation across teams since many HashiCorp products rely on yamux. We're also planning on at least improving the archaic infrastructure to pass go test -race and go vet.

Lots of vacations this time of year for us as well, so it will probably be a couple weeks before we're able to merge this.

@schmichael
Copy link
Member

schmichael commented Aug 2, 2024

Sorry for the delay! I've finally been testing this patch. Playing around for a while with tc and nodesim I haven't discovered any problems with the patch, but unfortunately I haven't reproduced the original Nomad issue either. I'm not too surprised I haven't reproduced this as there are a lot of variables at play and reproducing production network/compute conditions to cause race conditions is extremely challenging.

The fact that this patch seems to have solved your issue makes me lean toward merging this even if we can't come to a satisfactory explanation of why.

The wakeup is via a non-blocking send to recvNotifyCh/sendNotifyCh, which are unbuffered channels.

(Emphasis mine) These are buffered channels: https://github.com/hashicorp/yamux/blob/v0.1.1/stream.go#L75-L76

I don't see a way for Stream.Read to miss a send on recvNotifyCh as there's only one receive: https://github.com/hashicorp/yamux/blob/v0.1.1/stream.go#L146

I know it's been a long time, but do you still have a goroutine dump handy? Is it possible another goroutine was blocked in session.go for a similar amount of time? It seems likely not, but perhaps there's a way Session.Close() can get deadlocked between close(s.shutdownCh) (which your PR would now check and avoid the deadlock) and stream.forceClose() (which should always send on recvNotifyCh in a way that unblocks Stream.Read()).

I tried to write a test to exercise the recvNotify is insufficient to notice disconnects but can't get it to fail on v0.1.1 as I would expect. Any chance you could take a look and see if you can find a way to make it fail? https://gist.github.com/schmichael/eae76f127aa31475fde152a47ef4a38b

I sprinkled some fmt.Println()s into Stream.Read() as well to try to ensure the Read was waiting before I closed the Session, but I might be missing something.

@schmichael
Copy link
Member

Argh sorry that #130 caused a merge conflict. Tests now pass with the race detector enabled, and we finally enabled GHA CI!

We've been testing your patch more internally, so I wanted to leave a quick update:

  1. My test above tried to exercise the bug by explicitly calling Session.Close, but @rboyer pointed out a missed yamux heartbeat is probably key to exercising the bug.
  2. @rboyer has observed a (the?) bug in his own test, but not consistently yet. So far it appears this patch fixes it!
  3. Concurrent calls to Stream.Read can return different data #128 may be relevant here. If Nomad is doing that, then that's a bug in Nomad we need to fix. I intend to dig into that a bit. That being said, even if there's a bug due to yamux misuse in the Nomad code base, I'm not opposed to merging a patch in yamux that makes the misuse less pathological.

I'm hoping to get this all finished up this week so it makes it into the next Nomad patch release.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a lot of attempts I have been unable to cause the bug this appears to fix.

In theory if io.ReadFull blocked indefinitely that could prevent Session.Close() from closing each Stream. This would leave the yamux Session in the precise state being seen here: Session.shutdownCh is closed, but individual streams have not been notified.

If that case is possible, this patch fixes it. Unfortunately the only way I was able to replicate that case was when disabling client write timeouts (as well as carefully blocking server writes to deadlock keepalive handling) which is not the case in the Nomad environment in the original bug report: hashicorp/nomad#23305

Goroutine dumps observing this bug would be extremely useful and may be reported privately to: nomad-oss-debug@hashicorp.com

After extensive testing I have found no downside to accepting this patch and intend to merge it. Leaving open a few more days to let other folks have a chance to review.

See #131 for some of the changes I made to the test suite while testing this. Note that those changes are not inclusive: I also did a lot of manual testing directly with yamux as well as via Nomad. At no point could I exercise either a bug or a problem with this PR.

Thanks @lattwood, and I'm very sorry for the long delay!

@lattwood
Copy link
Contributor Author

@schmichael so, without revealing too much, what we're doing with Nomad regularly sees us have our hosting providers firewall off the Nomad Masters and the Nomad Agents for upwards of 30 minutes at a time. Agents can initiate TCP connections to the Masters when this happens, but all traffic from the Masters to the Agents becomes blocked. So the client SYN makes it to the server, but the server SYN-ACK doesn't leave the provider's network.

I think that could actually manifest as io.ReadFull blocking on the line in question, since that's reading from a bufio.Reader from the conn passed in to newSession which comes from the io.ReadWriteCloser passed to Server or Client. If there's no deadline set on the underlying net.Conn connection, when that firewalling of traffic occurs, it would block at least until the OS realizes the other side of the TCP connection has gone away.

So I think you're onto something here.


With all that said, we are still seeing the issue occur periodically, but now we don't have such a smoking gun in the form of yamux.

I'll see about grabbing some goroutine stacktraces the next time I'm kicking nodes.

And I'll get this PR's conflicts addressed.

@lattwood
Copy link
Contributor Author

~ > nomad job status -json | jq -r '.[].Allocations[] | select(.ClientStatus == "pending") | .NodeName' | wc -l
      37

How many do ya want? :D

@schmichael
Copy link
Member

With all that said, we are still seeing the issue occur periodically, but now we don't have such a smoking gun in the form of yamux.

Hm, phooey. Is it significantly better than without the yamux patch?

How many do ya want? :D

Working on a distributed system written in Go means the team is pretty used to combing through huge mountains of goroutines, so share whatever you're able. 😁

@lattwood
Copy link
Contributor Author

It has been better. Here's (what I think is) the relevant snippet from the goroutine stacks-

image

blocked on a read due to no io deadline? and this is the goroutine running watchAllocations

@lattwood
Copy link
Contributor Author

lattwood commented Aug 21, 2024

@schmichael stacks sent.

It's a fork of Nomad 1.7.7 using this version of yamux- https://github.com/lattwood/yamux/tree/shutdown_edgecase (not the rebased branch for this PR)

#!/usr/bin/env bash

set -euo pipefail
IFS=$'\n\t'

target=$1
ssh "root@${target}" 'killall -SIGQUIT nomad && sleep 3'
ssh "root@${target}" 'journalctl --unit nomad.service --since "5 minutes ago" > gostacks.txt'
scp "root@${target}:gostacks.txt" "out/${target}_stacks.txt"
> parallel -j16 './get_stack.sh {}' ::: `nomad job status -json | jq -r '.[].Allocations[] | select(.ClientStatus == "pending") | .NodeName' | sort -u

Stream.Close is called when the Session's keepalive fails. This call wakes up
Stream.Read & Stream.write waiters. The wakeup is via a non-blocking send to
recvNotifyCh/sendNotifyCh, which are unbuffered channels.

In Stream.Read this send can be attempted between the call to
s.stateLock.Unlock() and the select statement at the bottom of the
method. When this occurs, if deadlines haven't been set on the Stream in
question, the call to Stream.Read will block forever.

By adding Session.shutdownCh to the channels in the select statement,
this coordination gap is addressed.

This uses `goto START` instead of `return 0, io.EOF`, as that would potentially
leave a last read on the buffer in the Read case.
In hashicorp#31, the Read path was changed to move away from
time.After. This change was not reflected in the Write path, and this
commit rectifies that.
When stopping a time.Timer, you need to check the return value and drain
the channel to prevent a memory leak, if the Timer has fired.
@lattwood lattwood force-pushed the lattwood/handle_shutdown_edgecase branch from 5cd99ad to 84b3fc6 Compare August 21, 2024 16:37
@lattwood
Copy link
Contributor Author

PR rebased against hashicorp/yamux/main

@lattwood
Copy link
Contributor Author

(this patch does reduce the frequency of it getting blocked tho)

@lattwood
Copy link
Contributor Author

@schmichael I think this is good to merge at this point?

Also, any chance you happened to look at that yamux fork mentioned on the Nomad issue about potentially swapping out for SPDY, in terms of compatibility issues? If it doesn't seem like a horrible idea, I think we could be open to running a Nomad fork on our end to kick the tires on it as it relates to solving the issue we're experiencing.

@lattwood
Copy link
Contributor Author

lattwood commented Sep 3, 2024

Friendly poke @schmichael

@schmichael
Copy link
Member

Peeked at a few of the goroutine dumps you sent and something is clearly blocking Session.recvLoop in io.ReadFull which without this PR could cause Stream.Read to block even when the session is closed.

Merging! Thanks for your extensive help and even more extensive patience @lattwood. We can continue debugging on the Nomad issue.

@schmichael schmichael merged commit b5c3b44 into hashicorp:master Sep 5, 2024
3 checks passed
@lattwood lattwood deleted the lattwood/handle_shutdown_edgecase branch September 5, 2024 18:41
schmichael added a commit to hashicorp/nomad that referenced this pull request Sep 24, 2024
The main point of this dependency upgrade is to pull in the fixes in
hashicorp/yamux#127 which prevents leaking deadlocked goroutines. It has
been observed to improve the issue in #23305 but does not
seem sufficient to fix it entirely.

Since touching yamux is a rare and scary event, I do **not** intend to
backport this. If we discover the improvements are stable and
significant enough, or if further fixes land in yamux, backporting can
be done at that time.
schmichael added a commit to hashicorp/nomad that referenced this pull request Sep 24, 2024
The main point of this dependency upgrade is to pull in the fixes in
hashicorp/yamux#127 which prevents leaking deadlocked goroutines. It has
been observed to improve the issue in #23305 but does not
seem sufficient to fix it entirely.

Since touching yamux is a rare and scary event, I do **not** intend to
backport this. If we discover the improvements are stable and
significant enough, or if further fixes land in yamux, backporting can
be done at that time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pending allocations stuck in pending state after adoption by a new deployment
2 participants