Network health checks and monitoring #12

phahulin · 2018-03-26T13:54:35Z

Title

  Title: Network health checks and monitoring
  Layer: Service

Abstract

A system to check network health state from eth point of view should be developed

Rationale

While it is possible to setup monitoring system to check health of individual nodes of the network, it is also important to perform ethereum-specific and consensus-specific health checks on the network as a whole.

Specification

A group of periodically running tests should be setup on both sokol and core.
Tests should be separated in individual modules/files and run independently on a schedule. It should be possible to set individual schedule for each test.

Tests should include:

check if any validator nodes are missing rounds
check if payout script works properly for all nodes (check mining address balance)
periodically send a series of txs to check that all validator nodes are able to mine non-empty blocks
periodically send txs via public rpc endpoint
check for reorgs

In case any of the tests fails, notification should be sent to the dev team.

Tests should be protected from starting a new run if the previous run has not completed yet.
Tests should be enforced to have a timeout and be killed if they don't complete within certain time.

Test results should be saved for later analysis to a database.

Implementation

Setup a new server on each network, deploy a full parity node.
Run tests locally on cron. An account with some small amount of POA will be required to run tests with txs.
Save test results to sqlite database.
Deploy a simple node.js web app with a single api endpoint to retrieve latest test results from the database.
Setup a monitor on this api endpoint, send alerts to slack channel.

The text was updated successfully, but these errors were encountered:

natlg · 2018-05-18T08:38:17Z

Hi @phahulin,
I started working on this task. I'm almost done with tests "check if any validator nodes are missing rounds" and "periodically send a series of txs to check that all validator nodes are able to mine non-empty blocks" (code is here ).
But I don't understand the last point, could you please explain what reorgs means in the "check for reorgs" test?

phahulin · 2018-05-18T17:21:53Z

Hi, @Natalya11444

By reorgs I mean forks similar to https://etherscan.io/blocks_forked - events when a node has to rewrite its recent history because it received blocks from a "longer" chain. This one may be tricky to implement and needs some experimenting.

one way is to monitor parity logs directly for messages about reorgs:

2018-05-18 20:02:52  Reorg to #1088 0x9478…1f84 (0x191f…700c #1087 0x83b9…e4ca )

(just an example taken from my local setup, not from a real network)

another way is to keep hashes of last N blocks (say N = 20) in memory and recheck them to see if any of them changed
this one I haven't tested myself, so can't be sure if it actually works: use https://wiki.parity.io/JSONRPC-Eth-Pub-Sub-Module.html functionality and subscribe to newHeads event
maybe there's another way

You can simulate reorgs on your local setup by using a simplified network with two validators similar to the this one
On step 4 you start two parity nodes with different validators. If you let them run for some time, you'll see they're building blocks in parallel, since two nodes yet don't know about each other

Validator1 history: 12:00:00 Block1 --> 12:00:10 Block2 --> 12:00:20 Block3 --> ...
Validator2 history:      12:00:05 Block1 --> 12:00:15 Block2 --> 12:00:25 Block 3 --> ...

then when you call ./mate.sh their enodes are exchanged, and one of them will switch to another one's history - at this moment you'll see a Reorg event in logs.

natlg · 2018-05-18T21:14:56Z

Thank you for the answer! I'll follow the guide after other tests and will let you know once I finish.

natlg · 2018-06-02T08:21:22Z

I deployed monitoring for tests 1-3 on the server, please check how it works, then I can change it if needed.

Monitor runs on cron (every 30 minutes for now). It calls web server and sends messages with last failed tests for each network to the slack channel.
I used test channel, here is how messages look like:
https://1drv.ms/u/s!Au_4rxfmZk63grpvqjnQggqEVik38g

Web server returns tests results as JSON
For the Sokol network:
http://poatest.westus.cloudapp.azure.com:3000/sokol/api/failed?lastseconds=3600 will return failed tests for the last hour, "lastseconds" is optional parameter, without it all result from the database will be returned.
http://poatest.westus.cloudapp.azure.com:3000/sokol/api/all?lastseconds=3600 - returns both passed and failed test results

For the Core network it's similar:
http://poatest.westus.cloudapp.azure.com:3000/core/api/failed

Tests run via cron also, each test is in separate file. They use the command line arguments to detect which network to check. If no arguments are sent, parameters from the toml file will be used.
Tests save results to the sqlite database.
Test with txs runs on Sokol only, because I don't have account with real POA yet.

Two parity nodes for the each network run both on the server (they use different ports).

Here is what I plan to add:

Remaining tests
Timeout for tests, preventing duplicate cron job executions
Probably add UI for test result, and send a link to it in slack messages
After changing reward algorithm (as in the issue Block Reward emission by time #16) it will be needed to change test for payout script.
Probably add statistic for validator nodes (how many blocks are mined by each of them, how many blocks with txs, rewards)

Repository is here with some more information in the README.

phahulin · 2018-06-04T13:51:35Z

@Natalya11444 thanks! I'll check it out and let you know

natlg · 2018-06-06T19:21:49Z

Ok, just I can't use the server now, it will be available at June 9th or 10th.

phahulin · 2018-06-10T09:39:44Z

Hey @Natalya11444 I'm going through the code and it looks great so far, thank you for your work! Would you mind if I open issues/PRs in your repository with some suggestions?

natlg · 2018-06-10T20:10:16Z

@phahulin, thank you for checking out, I'm glad you liked it! Sure, please add suggestions in the repository.

natlg · 2018-06-16T09:06:30Z

I've added UI for test results, with search and filters http://poatest.westus.cloudapp.azure.com:3001 , repository is here.
Also remaining tests are implemented, I added timeout and checking for duplicate cron job executions.
Tests with sending txs are not running on Core.

For now tests can fail when some validators miss rounds. If it's too long then tests for sending txs fail as well if these validators don't create blocks with them in few rounds. And when they return then reorg can happen.

phahulin · 2018-06-18T16:16:19Z

@Natalya11444 UI looks great, thank you. I'll try to deploy scripts and ui on our server

natlg · 2018-06-18T20:13:04Z

@phahulin cool, I'll add more instructions for deployment to readme. Please let me know if there will be some issues.

natlg · 2018-06-21T01:27:54Z

I updated bash scripts for tests running, they were quite bulky. They can be added to the cron then.

igorbarinov · 2018-07-02T19:07:25Z

https://github.com/poanetwork/poa-network-monitor

phahulin mentioned this issue Mar 28, 2018

Update SOKOL to parity 1.10 poanetwork/deployment-playbooks#107

Closed

50 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network health checks and monitoring #12

Network health checks and monitoring #12

phahulin commented Mar 26, 2018 •

edited

Loading

natlg commented May 18, 2018

phahulin commented May 18, 2018

natlg commented May 18, 2018

natlg commented Jun 2, 2018

phahulin commented Jun 4, 2018

natlg commented Jun 6, 2018

phahulin commented Jun 10, 2018

natlg commented Jun 10, 2018

natlg commented Jun 16, 2018

phahulin commented Jun 18, 2018 •

edited

Loading

natlg commented Jun 18, 2018

natlg commented Jun 21, 2018

igorbarinov commented Jul 2, 2018

Network health checks and monitoring #12

Network health checks and monitoring #12

Comments

phahulin commented Mar 26, 2018 • edited Loading

Title

Abstract

Rationale

Specification

Implementation

natlg commented May 18, 2018

phahulin commented May 18, 2018

natlg commented May 18, 2018

natlg commented Jun 2, 2018

phahulin commented Jun 4, 2018

natlg commented Jun 6, 2018

phahulin commented Jun 10, 2018

natlg commented Jun 10, 2018

natlg commented Jun 16, 2018

phahulin commented Jun 18, 2018 • edited Loading

natlg commented Jun 18, 2018

natlg commented Jun 21, 2018

igorbarinov commented Jul 2, 2018

phahulin commented Mar 26, 2018 •

edited

Loading

phahulin commented Jun 18, 2018 •

edited

Loading