Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(hang): Stop blocking some Zebra futures for up to a minute using a CPU busy-loop, Credit: Ziggurat Team (#6763), james_katz (#7000) #7103

Merged
merged 3 commits into from
Jun 30, 2023

Conversation

teor2345
Copy link
Collaborator

@teor2345 teor2345 commented Jun 29, 2023

Motivation

Since PR #6235 was merged in March 2023, Zebra's progress logging future has been busy-waiting in a loop for at least 55 seconds out of every minute.

Since this loop doesn't call into the tokio executor, it never yields, so it blocks all other futures running on that thread.

This is a likely cause for these bugs:
Close #6763
Close #7000

It might also possibly be causing these bugs, or making them worse:

Specifications

https://docs.rs/tokio/latest/tokio/#cpu-bound-tasks-and-blocking-code

Complex Code or Requirements

tokio uses cooperative multitasking with internal yields, but only tokio APIs can yield to tokio:
https://docs.rs/tokio/latest/tokio/task/index.html#cooperative-scheduling

Solution

Sleep before continuing the loop.
Use monotonic Instant times, so that changing the computer's clock doesn't cause progress updates to stop.

Review

This seems like it might be a release blocker, given the number of bugs it impacts.

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
    • Does it change concurrent code, unsafe code, or consensus rules?
  • How do you know it works? Does it have tests?

Follow Up Work

We should ask users to re-test and see if any of these bugs are still happening.

@teor2345 teor2345 added C-bug Category: This is a bug P-High 🔥 C-security Category: Security issues I-hang A Zebra component stops responding to requests I-heavy Problems with excessive memory, disk, or CPU usage A-network Area: Network protocol updates or fixes A-concurrency Area: Async code, needs extra work to make it work properly. labels Jun 29, 2023
@teor2345 teor2345 self-assigned this Jun 29, 2023
@teor2345 teor2345 requested a review from a team as a code owner June 29, 2023 06:35
@teor2345 teor2345 requested review from oxarbitrage and removed request for a team June 29, 2023 06:35
@teor2345
Copy link
Collaborator Author

@mpguerra is it ok if we get this merged before the release?

It would finish off all our current network bugs, and get Zebra ready for re-testing.

@teor2345 teor2345 changed the title fix(hang): Stop busy-waiting in a future for 55 seconds every minute, Credit: Ziggurat Team (#6763), james_katz (#7000) fix(hang): Stop blocking some Zebra futures for up to a minute using a CPU busy-loop, Credit: Ziggurat Team (#6763), james_katz (#7000) Jun 29, 2023
@teor2345 teor2345 mentioned this pull request Jun 29, 2023
44 tasks
@codecov
Copy link

codecov bot commented Jun 29, 2023

Codecov Report

Merging #7103 (45dbb42) into main (455779c) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7103      +/-   ##
==========================================
- Coverage   77.38%   77.35%   -0.03%     
==========================================
  Files         310      310              
  Lines       41795    41796       +1     
==========================================
- Hits        32343    32332      -11     
- Misses       9452     9464      +12     

conradoplg
conradoplg previously approved these changes Jun 29, 2023
Copy link
Contributor

@conradoplg conradoplg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I monitored CPU usage before & after and I think it shows the difference, before there was a core at 100% most of the time

Screenshot from 2023-06-29 11-08-35

Screenshot from 2023-06-29 11-06-50

@teor2345
Copy link
Collaborator Author

Good catch! I monitored CPU usage before & after and I think it shows the difference, before there was a core at 100% most of the time

Thanks, I was going to check this!

@teor2345
Copy link
Collaborator Author

I got the elapsed time calculation wrong, the tests should be fixed now.

@teor2345
Copy link
Collaborator Author

teor2345 commented Jun 30, 2023

I can confirm, the only CPU heavy threads left on my machine are my zcashd testnet miner. All my Zebra instances are at 50% CPU or less.

Screenshot 2023-06-30 at 10 15 13

mergify bot added a commit that referenced this pull request Jun 30, 2023
@mergify mergify bot merged commit 322cbec into main Jun 30, 2023
279 checks passed
@mergify mergify bot deleted the stop-futures-hangs branch June 30, 2023 16:58
mergify bot pushed a commit that referenced this pull request Jul 2, 2023
* Update license description in README for MIT-only crates

* Draft changelog with trivial issues

* Remove trivial issues

* Update changelog entries as of commit 2a31972 and PR #7103

* Update mainnet and testnet checkpoints as of 2023-06-30

* chore: Release

* Estimate release height for Zebra v1.0.1

Block height 2139118 at 2023-06-30 01:55:38 UTC
Release is likely to be 2023-07-01
2139118 + 1152 * 3 = 2142574

Then round up to the nearest 1000.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. A-network Area: Network protocol updates or fixes C-bug Category: This is a bug C-security Category: Security issues I-hang A Zebra component stops responding to requests I-heavy Problems with excessive memory, disk, or CPU usage
Projects
None yet
2 participants