Monitoing & Alerting guide (#231)

* Monitoing & Alerting guide Signed-off-by: haroldsphinx <adedayoakinpelu@gmail.com>
ObolNetwork · Jul 18, 2023 · 03575ad · 03575ad
1 parent b808815
commit 03575ad
Show file tree

Hide file tree

Showing 2 changed files with 192 additions and 2 deletions.
diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md
@@ -2,6 +2,17 @@
 sidebar_position: 4
 description: Add monitoring credentials to help the Obol Team monitor the health of your cluster
 ---
+# Getting Started Monitoring your Node
+
+Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters.
+
+## Pre-requisites
+
+Ensure the following software are installed:
+
+- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)**
+- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
+- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
 
 # Push metrics to Obol Monitoring
 
@@ -37,4 +48,88 @@ scrape_configs:
   - job_name: 'node-exporter'
     static_configs:
       - targets: ['node-exporter:9100']
-```
+```
+
+## Import Pre-Configured Charon Dashboards
+
+- Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
+- In your Grafana interface, create a new dashboard and select the import option.
+
+- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.
+
+- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.
+
+## Example alerting rules
+
+To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page:
+
+1. Visit the **[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)** page. Here, you will find lists of Prometheus alerting rules categorized by hardware, system, and services.
+
+2. Depending on your need, select the category of alerts. For example, if you want to set up alerts for your system's CPU usage, click on the 'CPU' under the 'Host & Hardware' category.
+
+3. On the selected page, you'll find specific alert rules like 'High CPU Usage'. Each rule will provide the PromQL expression, alert name, and a brief description of what the alert does. You can copy these rules.
+
+4. Paste the copied rules into your Prometheus configuration file under the `rules` section. Make sure you understand each rule before adding it to avoid unnecessary alerts.
+
+5. Finally, save and apply the configuration file. Prometheus should now trigger alerts based on these rules.
+
+
+For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules).
+
+## Understanding Alert rules
+
+1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
+2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
+3. `ClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline.
+4. `ClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
+5. `ClusterInUnknownStatus`: This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
+6. `ClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
+7. `ClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
+8. `ClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
+9. `ClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
+10. `ClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the cluster falls to 0.
+11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
+
+## Best Practices for Monitoring Charon Nodes & Cluster
+
+- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
+- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
+- **Configure Alerts**: Based on these metrics, set up actionable alerts.
+- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
+- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
+- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
+- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
+- **Automate Monitoring**: Use automation to ensure no issues go undetected.
+- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
+- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.
+
+## Third-Party Services for Uptime Testing
+
+- [updown.io](https://updown.io/)
+- [Grafana synthetic Monitoring](https://grafana.com/grafana/plugins/grafana-synthetic-monitoring-app/)
+
+## Key metrics to watch to verify node health based on jobs
+
+### Node Exporter:
+
+**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
+
+**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.
+
+**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.
+
+**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.
+
+**Disk Space**: Running out of disk space can lead to application errors and data loss.
+
+**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.
+
+**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.
+
+**Latency**: The delay before a transfer of data begins following an instruction for its transfer.
+
+It is also important to check:
+
+- NTP clock skew
+- Process restarts and failures (eg. through `node_systemd`)
+- alert on high error and panic log counts.
diff --git a/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md b/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md
@@ -2,6 +2,17 @@
 sidebar_position: 4
 description: Add monitoring credentials to help the Obol Team monitor the health of your cluster
 ---
+# Getting Started Monitoring your Node
+
+Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters.
+
+## Pre-requisites
+
+Ensure the following software are installed:
+
+- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)**
+- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
+- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
 
 # Push metrics to Obol Monitoring
 
@@ -37,4 +48,88 @@ scrape_configs:
   - job_name: 'node-exporter'
     static_configs:
       - targets: ['node-exporter:9100']
-```
+```
+
+## Import Pre-Configured Charon Dashboards
+
+- Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
+- In your Grafana interface, create a new dashboard and select the import option.
+
+- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.
+
+- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.
+
+## Example alerting rules
+
+To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page:
+
+1. Visit the **[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)** page. Here, you will find lists of Prometheus alerting rules categorized by hardware, system, and services.
+
+2. Depending on your need, select the category of alerts. For example, if you want to set up alerts for your system's CPU usage, click on the 'CPU' under the 'Host & Hardware' category.
+
+3. On the selected page, you'll find specific alert rules like 'High CPU Usage'. Each rule will provide the PromQL expression, alert name, and a brief description of what the alert does. You can copy these rules.
+
+4. Paste the copied rules into your Prometheus configuration file under the `rules` section. Make sure you understand each rule before adding it to avoid unnecessary alerts.
+
+5. Finally, save and apply the configuration file. Prometheus should now trigger alerts based on these rules.
+
+
+For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules).
+
+## Understanding Alert rules
+
+1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
+2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
+3. `ClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline.
+4. `ClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
+5. `ClusterInUnknownStatus`: This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
+6. `ClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
+7. `ClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
+8. `ClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
+9. `ClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
+10. `ClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the cluster falls to 0.
+11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
+
+## Best Practices for Monitoring Charon Nodes & Cluster
+
+- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
+- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
+- **Configure Alerts**: Based on these metrics, set up actionable alerts.
+- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
+- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
+- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
+- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
+- **Automate Monitoring**: Use automation to ensure no issues go undetected.
+- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
+- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.
+
+## Third-Party Services for Uptime Testing
+
+- [updown.io](https://updown.io/)
+- [Grafana synthetic Monitoring](https://grafana.com/grafana/plugins/grafana-synthetic-monitoring-app/)
+
+## Key metrics to watch to verify node health based on jobs
+
+### Node Exporter:
+
+**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
+
+**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.
+
+**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.
+
+**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.
+
+**Disk Space**: Running out of disk space can lead to application errors and data loss.
+
+**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.
+
+**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.
+
+**Latency**: The delay before a transfer of data begins following an instruction for its transfer.
+
+It is also important to check:
+
+- NTP clock skew
+- Process restarts and failures (eg. through `node_systemd`)
+- alert on high error and panic log counts.