Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network health checks and monitoring #12

Open
phahulin opened this issue Mar 26, 2018 · 13 comments
Open

Network health checks and monitoring #12

phahulin opened this issue Mar 26, 2018 · 13 comments

Comments

@phahulin
Copy link
Contributor

phahulin commented Mar 26, 2018

Title

  Title: Network health checks and monitoring
  Layer: Service

Abstract

A system to check network health state from eth point of view should be developed

Rationale

While it is possible to setup monitoring system to check health of individual nodes of the network, it is also important to perform ethereum-specific and consensus-specific health checks on the network as a whole.

Specification

A group of periodically running tests should be setup on both sokol and core.
Tests should be separated in individual modules/files and run independently on a schedule. It should be possible to set individual schedule for each test.

Tests should include:

  • check if any validator nodes are missing rounds
  • check if payout script works properly for all nodes (check mining address balance)
  • periodically send a series of txs to check that all validator nodes are able to mine non-empty blocks
  • periodically send txs via public rpc endpoint
  • check for reorgs

In case any of the tests fails, notification should be sent to the dev team.

Tests should be protected from starting a new run if the previous run has not completed yet.
Tests should be enforced to have a timeout and be killed if they don't complete within certain time.

Test results should be saved for later analysis to a database.

Implementation

Setup a new server on each network, deploy a full parity node.
Run tests locally on cron. An account with some small amount of POA will be required to run tests with txs.
Save test results to sqlite database.
Deploy a simple node.js web app with a single api endpoint to retrieve latest test results from the database.
Setup a monitor on this api endpoint, send alerts to slack channel.

@natlg
Copy link

natlg commented May 18, 2018

Hi @phahulin,
I started working on this task. I'm almost done with tests "check if any validator nodes are missing rounds" and "periodically send a series of txs to check that all validator nodes are able to mine non-empty blocks" (code is here ).
But I don't understand the last point, could you please explain what reorgs means in the "check for reorgs" test?

@phahulin
Copy link
Contributor Author

Hi, @Natalya11444

By reorgs I mean forks similar to https://etherscan.io/blocks_forked - events when a node has to rewrite its recent history because it received blocks from a "longer" chain. This one may be tricky to implement and needs some experimenting.

  • one way is to monitor parity logs directly for messages about reorgs:
2018-05-18 20:02:52  Reorg to #1088 0x9478…1f84 (0x191f…700c #1087 0x83b9…e4ca )

(just an example taken from my local setup, not from a real network)

  • another way is to keep hashes of last N blocks (say N = 20) in memory and recheck them to see if any of them changed

  • this one I haven't tested myself, so can't be sure if it actually works: use https://wiki.parity.io/JSONRPC-Eth-Pub-Sub-Module.html functionality and subscribe to newHeads event

  • maybe there's another way


You can simulate reorgs on your local setup by using a simplified network with two validators similar to the this one
On step 4 you start two parity nodes with different validators. If you let them run for some time, you'll see they're building blocks in parallel, since two nodes yet don't know about each other

Validator1 history: 12:00:00 Block1 --> 12:00:10 Block2 --> 12:00:20 Block3 --> ...
Validator2 history:      12:00:05 Block1 --> 12:00:15 Block2 --> 12:00:25 Block 3 --> ...

then when you call ./mate.sh their enodes are exchanged, and one of them will switch to another one's history - at this moment you'll see a Reorg event in logs.

@natlg
Copy link

natlg commented May 18, 2018

Thank you for the answer! I'll follow the guide after other tests and will let you know once I finish.

@natlg
Copy link

natlg commented Jun 2, 2018

I deployed monitoring for tests 1-3 on the server, please check how it works, then I can change it if needed.

Monitor runs on cron (every 30 minutes for now). It calls web server and sends messages with last failed tests for each network to the slack channel.
I used test channel, here is how messages look like:
https://1drv.ms/u/s!Au_4rxfmZk63grpvqjnQggqEVik38g

Web server returns tests results as JSON
For the Sokol network:
http://poatest.westus.cloudapp.azure.com:3000/sokol/api/failed?lastseconds=3600 will return failed tests for the last hour, "lastseconds" is optional parameter, without it all result from the database will be returned.
http://poatest.westus.cloudapp.azure.com:3000/sokol/api/all?lastseconds=3600 - returns both passed and failed test results

For the Core network it's similar:
http://poatest.westus.cloudapp.azure.com:3000/core/api/failed

Tests run via cron also, each test is in separate file. They use the command line arguments to detect which network to check. If no arguments are sent, parameters from the toml file will be used.
Tests save results to the sqlite database.
Test with txs runs on Sokol only, because I don't have account with real POA yet.

Two parity nodes for the each network run both on the server (they use different ports).

Here is what I plan to add:

  • Remaining tests
  • Timeout for tests, preventing duplicate cron job executions
  • Probably add UI for test result, and send a link to it in slack messages
  • After changing reward algorithm (as in the issue Block Reward emission by time #16) it will be needed to change test for payout script.
  • Probably add statistic for validator nodes (how many blocks are mined by each of them, how many blocks with txs, rewards)

Repository is here with some more information in the README.

@phahulin
Copy link
Contributor Author

phahulin commented Jun 4, 2018

@Natalya11444 thanks! I'll check it out and let you know

@natlg
Copy link

natlg commented Jun 6, 2018

Ok, just I can't use the server now, it will be available at June 9th or 10th.

@phahulin
Copy link
Contributor Author

Hey @Natalya11444 I'm going through the code and it looks great so far, thank you for your work! Would you mind if I open issues/PRs in your repository with some suggestions?

@natlg
Copy link

natlg commented Jun 10, 2018

@phahulin, thank you for checking out, I'm glad you liked it! Sure, please add suggestions in the repository.

@natlg
Copy link

natlg commented Jun 16, 2018

I've added UI for test results, with search and filters http://poatest.westus.cloudapp.azure.com:3001 , repository is here.
Also remaining tests are implemented, I added timeout and checking for duplicate cron job executions.
Tests with sending txs are not running on Core.

For now tests can fail when some validators miss rounds. If it's too long then tests for sending txs fail as well if these validators don't create blocks with them in few rounds. And when they return then reorg can happen.

@phahulin
Copy link
Contributor Author

phahulin commented Jun 18, 2018

@Natalya11444 UI looks great, thank you. I'll try to deploy scripts and ui on our server

@natlg
Copy link

natlg commented Jun 18, 2018

@phahulin cool, I'll add more instructions for deployment to readme. Please let me know if there will be some issues.

@natlg
Copy link

natlg commented Jun 21, 2018

I updated bash scripts for tests running, they were quite bulky. They can be added to the cron then.

@igorbarinov
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants