Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientActor's message queues are sometimes very congested #6438

Open
marcelo-gonzalez opened this issue Mar 15, 2022 · 6 comments
Open

ClientActor's message queues are sometimes very congested #6438

marcelo-gonzalez opened this issue Mar 15, 2022 · 6 comments
Labels
C-bug Category: This is a bug

Comments

@marcelo-gonzalez
Copy link
Contributor

marcelo-gonzalez commented Mar 15, 2022

Recently we've seen a mainnet archival node built from b949483 process blocks too slowly to sync. What happens is that intermittently it'll fall behind and then stay around 500 blocks behind HEAD, as it doesn't process blocks fast enough to catch up after falling behind.

After adding logging like this:

diff --git a/chain/network/src/peer/peer_actor.rs b/chain/network/src/peer/peer_actor.rs
 index 3542deba4..103b1adbc 100644
 — a/chain/network/src/peer/peer_actor.rs
 +++ b/chain/network/src/peer/peer_actor.rs
 @@ -434,6 +434,8 @@ impl PeerActor 

 { NetworkClientMessages::PartialEncodedChunkRequest(request, msg_hash) } 

 RoutedMessageBody::PartialEncodedChunkResponse(response) => {
 + debug!(target: "network", "peer actor -> client actor chunk part {:?} {:?}",
 + &response.chunk_hash, response.parts.iter().map(|p| p.part_ord).collect::
 
 >());
 NetworkClientMessages::PartialEncodedChunkResponse(response)
 }
 
 
diff --git a/chain/client/src/client_actor.rs b/chain/client/src/client_actor.rs
 index 0d2b1c792..032ee2ab7 100644
 — a/chain/client/src/client_actor.rs
 +++ b/chain/client/src/client_actor.rs
 @@ -564,6 +564,7 @@ impl Handler
 for ClientActor 

 { NetworkClientResponses::NoResponse } 

 NetworkClientMessages::PartialEncodedChunkResponse(response) => {
 + debug!(target: "client", "recv partial chunk response {:?} {:?}", &response.chunk_hash, response.parts.iter().map(|p| p.part_ord).collect::
 
 
 >());
 if let Ok(accepted_blocks) =
 self.client.process_partial_encoded_chunk_response(response)
 {
 
 
 

We can see that there's big delay between when the peer actor passes the partial encoded chunk message to the client actor and the point when the client actor actually receives and processes it (sometimes seconds long):

Mar 15 20:15:57.277 DEBUG network: peer actor -> client actor chunk part ChunkHash(CooC7KRNfFsuUuEWaCAYeghCCujXZbm5LGUhEJmPWYm4) [52]
...
...
Mar 15 20:16:00.799 DEBUG client: recv partial chunk response ChunkHash(CooC7KRNfFsuUuEWaCAYeghCCujXZbm5LGUhEJmPWYm4) [52]

So the node ends up taking too long to sync even though there's nothing wrong with its block/chunk processing speed. Note that this is a node built from b949483 (plus the extra logging), which contains #6333. So that PR definitely helps, but doesnt fully get rid of this

In debugging this, it could be helpful to add metrics that show this delay between when messages are sent to the client actor and when they're received (unless calculating it is too invasive/expensive with the volume of messages coming in, not sure).

cc @nikurt @bowenwang1996 @mm-near

@bowenwang1996 bowenwang1996 added the C-bug Category: This is a bug label Mar 15, 2022
@mm-near
Copy link
Contributor

mm-near commented Mar 16, 2022

Yes - and the proper solution is to 'split' ClientActor into couple separate Actors/processes -- so that the runtime execution doesn't block it.

@nikurt nikurt self-assigned this Mar 16, 2022
@bowenwang1996
Copy link
Collaborator

@nikurt did you assign yourself to fix the underlying issue or to add metrics?

@nikurt nikurt removed their assignment Mar 16, 2022
@nikurt
Copy link
Contributor

nikurt commented Mar 16, 2022

@bowenwang1996 to add metrics.
Let me unassign myself to avoid confusion

@bowenwang1996
Copy link
Collaborator

@bowenwang1996 to add metrics. Let me unassign myself to avoid confusion

Could you create a separate issue for the metrics?

@marcelo-gonzalez
Copy link
Contributor Author

marcelo-gonzalez commented Mar 18, 2022

Could you create a separate issue for the metrics?

@bowenwang1996 @nikurt ok filed #6460

@stale
Copy link

stale bot commented Jun 17, 2022

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug
Projects
None yet
Development

No branches or pull requests

4 participants