Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more lenient with streams in the pending_send queue. #261

Merged
merged 3 commits into from
May 10, 2018

Conversation

goffrie
Copy link
Contributor

@goffrie goffrie commented Apr 20, 2018

The is_peer_reset() check doesn't quite cover all the cases where we call
clear_queue, such as when we call recv_err. Instead of trying to make the
check more precise, let's gracefully handle spurious entries in the queue.

This fixes the issue I mentioned at #258 (comment).

The `is_peer_reset()` check doesn't quite cover all the cases where we call
`clear_queue`, such as when we call `recv_err`. Instead of trying to make the
check more precise, let's gracefully handle spurious entries in the queue.
@hawkw hawkw requested a review from carllerche April 20, 2018 23:33
@hawkw
Copy link
Collaborator

hawkw commented Apr 20, 2018

At a glance, this seems right to me --- it's similar to the first thing I tried while working on #247, which appeared to work correctly. However, I'd like to hear from @carllerche before approving it.

Copy link
Collaborator

@carllerche carllerche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not necessarily against this change. But, it is a step away from trying to be explicit about state transitions and tracking the current expected state w/ assertions (usually debug_assert).

I don't know if anybody has thoughts on that change in strategy.

// in clear_queue(). Instead of doing O(N) traversal through queue
// to remove, lets just ignore the stream here.
trace!("removing dangling stream from pending_send");
counts.transition_after(stream, is_counted, is_pending_reset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this needs to be called here? I don't think this was called before (and I could believe it was another bug).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it was not called before. I think that it should be, otherwise we could leak streams - it's possible that the pending_send queue here was holding onto the last reference to the stream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is highly likely that this change is correct, but i'd like to see if we can figure out a failing test case.

@seanmonstar
Copy link
Member

The change looks fine, though I'm also curious about two things:

  1. Why is counts.transition_after called? It wasn't before, does it need to be?
  2. While I'm always happy to reduce crashes, having "spurious" streams worries me there's a bug somewhere. It'd probably be nice to at least have a debug_assert to look for bugs?

@carllerche
Copy link
Collaborator

I believe that this change is probably the right direction. Panicking in production is no good. It would be good to figure out how to catch any bugs that are let through though. At the very least a debug_assert is good.

@goffrie
Copy link
Contributor Author

goffrie commented Apr 27, 2018

I added a debug_assert!, though I'm not sure what precisely you had in mind - this one's pretty general. (Notably, asserting is_reset() doesn't work because the stream can be in the Closed(EndStream) state and still get cleared.)

@carllerche
Copy link
Collaborator

carllerche commented May 4, 2018

So I spent a bunch of time trying to reproduce the issue described in the comment.

All the paths that I tracked down via recv_err would not hit the issue.

Either recv_err is called because of a connection level issue. At this point, the code switches to GOAWAY and won't go through prioritize anymore. Or, recv_err is called when sending an explicit reset, in which case there will always be a frame to pop (the reset frame).

That said, I think that the consensus is that this is still a good change even if it does not fix any bugs.

Edit: This is referencing streams still being in pending_send even though there is no more data to send. I will dig into the counts a bit more now.

@goffrie
Copy link
Contributor Author

goffrie commented May 4, 2018

If you run the fuzzer in #263 you can definitely run into panics in this code if this patch isn't applied.

@carllerche
Copy link
Collaborator

k, i will try that

@carllerche
Copy link
Collaborator

Thanks to the fuzzer (#263), I have identified the logic path that results in the panic.

Unfortunately, it relies on a very specific pattern of the underlying I/O's write fn returning Ready / NotReady, which means that writing a robust test is not really possible. Any change to the h2 implementation could break the test.

So, given that, I think the best course of action is just going to be to include the fuzzing run and call it a day.

I still need to verify the counts.transition_after(stream, is_counted, is_pending_reset); line. I believe that this does fix a bug, but I would like to confirm this first. This will be my next focus.

@goffrie Thanks for your patience. My assumption is that you are currently using a fork that includes this patch, so you are not currently blocked. If this assumption is incorrect, please let me know and we can merge the PR before I complete the work stated above.

@goffrie
Copy link
Contributor Author

goffrie commented May 5, 2018 via email

@carllerche carllerche mentioned this pull request May 6, 2018
@carllerche
Copy link
Collaborator

So, I haven't reproed the issue w/ the missing count decrement, but I did find & fix a number of other related bugs (#273).

carllerche added a commit that referenced this pull request May 6, 2018
This patch adds a debug level assertion checking that state associated
with streams is successfully released. This is done by ensuring that no
stream state remains when the connection inner state is dropped. This
does not happen until all handles to the connection drops, which should
result in all streams being dropped.

This also pulls in a fix for one bug that is exposed with this change.
The fix is first proposed as part of #261.
@carllerche
Copy link
Collaborator

Ok! I finally got an assertion that fails and is fixed by adding transition_after where you have added it.

I pulled in that specific change as part of 44c9005.

Once I'm done with all that work, this PR can be rebased on top of that.

carllerche added a commit that referenced this pull request May 9, 2018
This patch includes two new significant debug assertions:

* Assert stream counts are zero when the connection finalizes.
* Assert all stream state has been released when the connection is 
  dropped.

These two assertions were added in an effort to test the fix provided
by #261. In doing so, many related bugs have been discovered and fixed.
The details related to these bugs can be found in #273.
@carllerche
Copy link
Collaborator

Alright, I have rebased this change.

Copy link
Collaborator

@carllerche carllerche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch fixes a bug that is only exposed with a particular sequence of write calls returning Ready vs. NotReady.

Fuzzing catches this but there is no explicit test that does.

@carllerche carllerche merged commit 571bb14 into hyperium:master May 10, 2018
@goffrie goffrie deleted the reset branch March 1, 2020 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants