Be more lenient with streams in the `pending_send` queue. #261

goffrie · 2018-04-20T21:15:21Z

The is_peer_reset() check doesn't quite cover all the cases where we call
clear_queue, such as when we call recv_err. Instead of trying to make the
check more precise, let's gracefully handle spurious entries in the queue.

This fixes the issue I mentioned at #258 (comment).

The `is_peer_reset()` check doesn't quite cover all the cases where we call `clear_queue`, such as when we call `recv_err`. Instead of trying to make the check more precise, let's gracefully handle spurious entries in the queue.

hawkw · 2018-04-20T23:37:53Z

At a glance, this seems right to me --- it's similar to the first thing I tried while working on #247, which appeared to work correctly. However, I'd like to hear from @carllerche before approving it.

carllerche

I'm not necessarily against this change. But, it is a step away from trying to be explicit about state transitions and tracking the current expected state w/ assertions (usually debug_assert).

I don't know if anybody has thoughts on that change in strategy.

carllerche · 2018-04-23T17:29:32Z

src/proto/streams/prioritize.rs

+                                // in clear_queue(). Instead of doing O(N) traversal through queue
+                                // to remove, lets just ignore the stream here.
+                                trace!("removing dangling stream from pending_send");
+                                counts.transition_after(stream, is_counted, is_pending_reset);


Could you explain why this needs to be called here? I don't think this was called before (and I could believe it was another bug).

I agree that it was not called before. I think that it should be, otherwise we could leak streams - it's possible that the pending_send queue here was holding onto the last reference to the stream.

I think it is highly likely that this change is correct, but i'd like to see if we can figure out a failing test case.

seanmonstar · 2018-04-24T18:56:34Z

The change looks fine, though I'm also curious about two things:

Why is counts.transition_after called? It wasn't before, does it need to be?
While I'm always happy to reduce crashes, having "spurious" streams worries me there's a bug somewhere. It'd probably be nice to at least have a debug_assert to look for bugs?

carllerche · 2018-04-26T17:11:40Z

I believe that this change is probably the right direction. Panicking in production is no good. It would be good to figure out how to catch any bugs that are let through though. At the very least a debug_assert is good.

goffrie · 2018-04-27T21:00:54Z

I added a debug_assert!, though I'm not sure what precisely you had in mind - this one's pretty general. (Notably, asserting is_reset() doesn't work because the stream can be in the Closed(EndStream) state and still get cleared.)

carllerche · 2018-05-04T18:40:54Z

So I spent a bunch of time trying to reproduce the issue described in the comment.

All the paths that I tracked down via recv_err would not hit the issue.

Either recv_err is called because of a connection level issue. At this point, the code switches to GOAWAY and won't go through prioritize anymore. Or, recv_err is called when sending an explicit reset, in which case there will always be a frame to pop (the reset frame).

That said, I think that the consensus is that this is still a good change even if it does not fix any bugs.

Edit: This is referencing streams still being in pending_send even though there is no more data to send. I will dig into the counts a bit more now.

goffrie · 2018-05-04T18:45:21Z

If you run the fuzzer in #263 you can definitely run into panics in this code if this patch isn't applied.

carllerche · 2018-05-04T18:49:59Z

k, i will try that

carllerche · 2018-05-05T16:11:06Z

Thanks to the fuzzer (#263), I have identified the logic path that results in the panic.

Unfortunately, it relies on a very specific pattern of the underlying I/O's write fn returning Ready / NotReady, which means that writing a robust test is not really possible. Any change to the h2 implementation could break the test.

So, given that, I think the best course of action is just going to be to include the fuzzing run and call it a day.

I still need to verify the counts.transition_after(stream, is_counted, is_pending_reset); line. I believe that this does fix a bug, but I would like to confirm this first. This will be my next focus.

@goffrie Thanks for your patience. My assumption is that you are currently using a fork that includes this patch, so you are not currently blocked. If this assumption is incorrect, please let me know and we can merge the PR before I complete the work stated above.

goffrie · 2018-05-05T17:10:40Z

That's correct; take your time. Thanks for your work on this :)

…

On Sat, May 5, 2018, 09:11 Carl Lerche, ***@***.***> wrote: Thanks to the fuzzer (#263 <#263>), I have identified the logic path that results in the panic. Unfortunately, it relies on a very specific pattern of the underlying I/O's write fn returning Ready / NotReady, which means that writing a robust test is not really possible. Any change to the h2 implementation could break the test. So, given that, I think the best course of action is just going to be to include the fuzzing run and call it a day. I still need to verify the counts.transition_after(stream, is_counted, is_pending_reset); line. I believe that this does fix a bug, but I would like to confirm this first. This will be my next focus. @goffrie <https://github.com/goffrie> Thanks for your patience. My assumption is that you are currently using a fork that includes this patch, so you are not currently blocked. If this assumption is incorrect, please let me know and we can merge the PR before I complete the work stated above. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#261 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLtTsJxxd6mpo09TVRlDIDVARTnIy7Sks5tvc8cgaJpZM4TeGaz> .

carllerche · 2018-05-06T17:55:49Z

So, I haven't reproed the issue w/ the missing count decrement, but I did find & fix a number of other related bugs (#273).

This patch adds a debug level assertion checking that state associated with streams is successfully released. This is done by ensuring that no stream state remains when the connection inner state is dropped. This does not happen until all handles to the connection drops, which should result in all streams being dropped. This also pulls in a fix for one bug that is exposed with this change. The fix is first proposed as part of #261.

carllerche · 2018-05-06T18:21:34Z

Ok! I finally got an assertion that fails and is fixed by adding transition_after where you have added it.

I pulled in that specific change as part of 44c9005.

Once I'm done with all that work, this PR can be rebased on top of that.

This patch includes two new significant debug assertions: * Assert stream counts are zero when the connection finalizes. * Assert all stream state has been released when the connection is dropped. These two assertions were added in an effort to test the fix provided by #261. In doing so, many related bugs have been discovered and fixed. The details related to these bugs can be found in #273.

carllerche · 2018-05-09T22:10:14Z

Alright, I have rebased this change.

carllerche

This patch fixes a bug that is only exposed with a particular sequence of write calls returning Ready vs. NotReady.

Fuzzing catches this but there is no explicit test that does.

Be more lenient with streams in the pending_send queue.

f93fde9

The `is_peer_reset()` check doesn't quite cover all the cases where we call `clear_queue`, such as when we call `recv_err`. Instead of trying to make the check more precise, let's gracefully handle spurious entries in the queue.

goffrie mentioned this pull request Apr 20, 2018

Fuzzing with honggfuzz-rs #263

Closed

hawkw requested a review from carllerche April 20, 2018 23:33

carllerche reviewed Apr 23, 2018

View reviewed changes

carllerche mentioned this pull request Apr 23, 2018

Reset queued streams on recv_err #259

Closed

carllerche mentioned this pull request Apr 25, 2018

proxy can error due to HTTP/2 stream resets linkerd/linkerd2#754

Closed

Add a debug_assert that the stream must be in a closed state.

b7309eb

carllerche mentioned this pull request May 4, 2018

Add more stream state tests #271

Merged

carllerche mentioned this pull request May 6, 2018

Misc bug fixes #273

Merged

Merge branch 'master' into reset

474acaa

carllerche approved these changes May 9, 2018

View reviewed changes

carllerche merged commit 571bb14 into hyperium:master May 10, 2018

goffrie deleted the reset branch March 1, 2020 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be more lenient with streams in the `pending_send` queue. #261

Be more lenient with streams in the `pending_send` queue. #261

goffrie commented Apr 20, 2018

hawkw commented Apr 20, 2018

carllerche left a comment

carllerche Apr 23, 2018

goffrie Apr 25, 2018

carllerche Apr 26, 2018

seanmonstar commented Apr 24, 2018

carllerche commented Apr 26, 2018

goffrie commented Apr 27, 2018

carllerche commented May 4, 2018 •

edited

Loading

goffrie commented May 4, 2018

carllerche commented May 4, 2018

carllerche commented May 5, 2018

goffrie commented May 5, 2018 via email

carllerche commented May 6, 2018

carllerche commented May 6, 2018

carllerche commented May 9, 2018

carllerche left a comment

Be more lenient with streams in the pending_send queue. #261

Be more lenient with streams in the pending_send queue. #261

Conversation

goffrie commented Apr 20, 2018

hawkw commented Apr 20, 2018

carllerche left a comment

Choose a reason for hiding this comment

carllerche Apr 23, 2018

Choose a reason for hiding this comment

goffrie Apr 25, 2018

Choose a reason for hiding this comment

carllerche Apr 26, 2018

Choose a reason for hiding this comment

seanmonstar commented Apr 24, 2018

carllerche commented Apr 26, 2018

goffrie commented Apr 27, 2018

carllerche commented May 4, 2018 • edited Loading

goffrie commented May 4, 2018

carllerche commented May 4, 2018

carllerche commented May 5, 2018

goffrie commented May 5, 2018 via email

carllerche commented May 6, 2018

carllerche commented May 6, 2018

carllerche commented May 9, 2018

carllerche left a comment

Choose a reason for hiding this comment

Be more lenient with streams in the `pending_send` queue. #261

Be more lenient with streams in the `pending_send` queue. #261

carllerche commented May 4, 2018 •

edited

Loading