Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(net): Avoid potential concurrency bugs in outbound handshakes #6869

Merged
merged 6 commits into from
Jun 8, 2023

Conversation

teor2345
Copy link
Collaborator

@teor2345 teor2345 commented Jun 8, 2023

Motivation

This PR fixes a bunch of potential concurrency bugs in Zebra's outbound handshake code.

Part of #6763.

This is not a blocker for the first stable release.

Complex Code or Requirements

This is concurrent code, but the final code should be a lot easier to reason about than the previous code, because it's all running concurrently in independent tasks.

Solution

  • Stop sending peer errors on the PeerSet channel
    • this fixes a potential deadlock or panic when the channel gets full
  • Move locking out of the cralwer select!
    • this avoids using the address book lock in a select! statement, which is very hard to reason about
  • Move report_failed() out of the CandidateSet
    • another move that makes it easier to reason about concurrency, because the code is now in the same file with no other locking
  • Make CandidateSet Send
    • just changes to generic bounds, it seems the compiler has got smarter about this since I last tried (or I have!)
  • Make all CandidateSet operations concurrent
    • this code has caused hang and deadlock bugs in the past, now it is fully concurrent, so there are no weird dependencies
  • Reduce the gap between handshakes and peer set updates
    • this is another concurrency risk - we used to delay updating the peer set or address book until the next time the crawler task looped around, now we just do it concurrently in the handshake task
  • Exit the crawler task on shutdown
    • when channels are disconnected, exit the task, rather than panicking or hanging

Testing

Our existing unit and integration tests have good coverage of this code, and "is this task concurrent" is hard to test.

Review

@arya2 and I discussed part of this design about a week ago.

I can split this into 2-3 PRs if it would be easier, or do a video review.

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
    • Does it change concurrent code, unsafe code, or consensus rules?
  • How do you know it works? Does it have tests?

Follow Up Work

Try diagnosing any remaining bugs in #6763.

@teor2345 teor2345 requested a review from arya2 June 8, 2023 05:55
@teor2345 teor2345 requested a review from a team as a code owner June 8, 2023 05:55
@teor2345 teor2345 self-assigned this Jun 8, 2023
@github-actions github-actions bot added the C-bug Category: This is a bug label Jun 8, 2023
@teor2345 teor2345 added P-Medium ⚡ C-security Category: Security issues I-hang A Zebra component stops responding to requests A-network Area: Network protocol updates or fixes A-concurrency Area: Async code, needs extra work to make it work properly. labels Jun 8, 2023
@codecov
Copy link

codecov bot commented Jun 8, 2023

Codecov Report

Merging #6869 (1e643e1) into main (8bd5a98) will decrease coverage by 0.06%.
The diff coverage is 76.92%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6869      +/-   ##
==========================================
- Coverage   77.73%   77.68%   -0.06%     
==========================================
  Files         310      310              
  Lines       41416    41441      +25     
==========================================
- Hits        32196    32194       -2     
- Misses       9220     9247      +27     

Copy link
Contributor

@arya2 arya2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is marvelous, thank you.

zebra-network/src/peer_set/initialize.rs Show resolved Hide resolved
zebra-network/src/peer_set/initialize.rs Show resolved Hide resolved
zebra-network/src/peer_set/initialize.rs Outdated Show resolved Hide resolved
zebra-network/src/peer_set/initialize.rs Show resolved Hide resolved
zebra-network/src/peer_set/initialize.rs Show resolved Hide resolved
zebra-network/src/peer_set/initialize.rs Show resolved Hide resolved
zebra-network/src/peer_set/initialize.rs Show resolved Hide resolved
mergify bot added a commit that referenced this pull request Jun 8, 2023
@mergify mergify bot merged commit 92077f4 into main Jun 8, 2023
@mergify mergify bot deleted the concurrent-peer-crawl branch June 8, 2023 23:43
dconnolly pushed a commit that referenced this pull request Jun 12, 2023
)

* Stop sending peer errors on the PeerSet channel, to respect send limits

* Move locking out of the cralwer select!, potential deadlock or hang risk

* Move report_failed() out of the CandidateSet, reducing concurrency risks

* Make CandidateSet Send

* Make all CandidateSet operations concurrent, previous hand/deadlock bug

* Reduce the gap between handshakes and peer set updates, and exit the task on shutdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. A-network Area: Network protocol updates or fixes C-bug Category: This is a bug C-security Category: Security issues I-hang A Zebra component stops responding to requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants