-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network health checks and monitoring #12
Comments
Hi @phahulin, |
Hi, @Natalya11444 By reorgs I mean forks similar to https://etherscan.io/blocks_forked - events when a node has to rewrite its recent history because it received blocks from a "longer" chain. This one may be tricky to implement and needs some experimenting.
(just an example taken from my local setup, not from a real network)
You can simulate reorgs on your local setup by using a simplified network with two validators similar to the this one
then when you call |
Thank you for the answer! I'll follow the guide after other tests and will let you know once I finish. |
I deployed monitoring for tests 1-3 on the server, please check how it works, then I can change it if needed. Monitor runs on cron (every 30 minutes for now). It calls web server and sends messages with last failed tests for each network to the slack channel. Web server returns tests results as JSON For the Core network it's similar: Tests run via cron also, each test is in separate file. They use the command line arguments to detect which network to check. If no arguments are sent, parameters from the toml file will be used. Two parity nodes for the each network run both on the server (they use different ports). Here is what I plan to add:
Repository is here with some more information in the README. |
@Natalya11444 thanks! I'll check it out and let you know |
Ok, just I can't use the server now, it will be available at June 9th or 10th. |
Hey @Natalya11444 I'm going through the code and it looks great so far, thank you for your work! Would you mind if I open issues/PRs in your repository with some suggestions? |
@phahulin, thank you for checking out, I'm glad you liked it! Sure, please add suggestions in the repository. |
I've added UI for test results, with search and filters http://poatest.westus.cloudapp.azure.com:3001 , repository is here. For now tests can fail when some validators miss rounds. If it's too long then tests for sending txs fail as well if these validators don't create blocks with them in few rounds. And when they return then reorg can happen. |
@Natalya11444 UI looks great, thank you. I'll try to deploy scripts and ui on our server |
@phahulin cool, I'll add more instructions for deployment to readme. Please let me know if there will be some issues. |
I updated bash scripts for tests running, they were quite bulky. They can be added to the cron then. |
Title
Abstract
A system to check network health state from eth point of view should be developed
Rationale
While it is possible to setup monitoring system to check health of individual nodes of the network, it is also important to perform ethereum-specific and consensus-specific health checks on the network as a whole.
Specification
A group of periodically running tests should be setup on both
sokol
andcore
.Tests should be separated in individual modules/files and run independently on a schedule. It should be possible to set individual schedule for each test.
Tests should include:
In case any of the tests fails, notification should be sent to the dev team.
Tests should be protected from starting a new run if the previous run has not completed yet.
Tests should be enforced to have a timeout and be killed if they don't complete within certain time.
Test results should be saved for later analysis to a database.
Implementation
Setup a new server on each network, deploy a full parity node.
Run tests locally on cron. An account with some small amount of POA will be required to run tests with txs.
Save test results to sqlite database.
Deploy a simple node.js web app with a single api endpoint to retrieve latest test results from the database.
Setup a monitor on this api endpoint, send alerts to slack channel.
The text was updated successfully, but these errors were encountered: