Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Monitoring and Auto-mitigating the unhealthy of docker containers in SONiC #564

Open
wants to merge 57 commits into
base: master
Choose a base branch
from
Open
Changes from 23 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
15c53ce
[monitoring] Add a document to provide the details about the monitoring
yozhao101 Feb 18, 2020
689c5a7
[Monitoring] Add an item in the section of overview.
yozhao101 Feb 19, 2020
2b31fef
[Moniting] Add functional requirements.
yozhao101 Feb 19, 2020
6a2c01a
[Monitoring] Add section of design overview.
yozhao101 Feb 19, 2020
ac56da8
[Monitoring] add section of design overview.
yozhao101 Feb 19, 2020
e294a9c
[Monitoring] Add section of design overview.
yozhao101 Feb 19, 2020
9546882
[Monitoring] Add introduction for auto-restart feature in overview.
yozhao101 Feb 19, 2020
8f157ec
[Monitoring] Add the section of basic approach.
yozhao101 Feb 19, 2020
752dad0
[Monitoring] Add paragraph in section of basic approach.
yozhao101 Feb 19, 2020
38d6cab
[Monitoring] Add description in the section of feature overview.
yozhao101 Feb 19, 2020
df37188
[Monitoring] Delete some extra blank lines.
yozhao101 Feb 19, 2020
6d04987
[Monitoring] Reword in the feature overview.
yozhao101 Feb 19, 2020
9724d9e
[Monitoring] Add a section of use cases.
yozhao101 Feb 19, 2020
fe17999
[Monitoring] Add section of Monitoring Critical Processes.
yozhao101 Feb 19, 2020
5d3bdfa
[Moniting] Add a section about monitoring the critical process.
yozhao101 Feb 19, 2020
c948aa2
[Monitoring] Add a section of monitoring critical resources.
yozhao101 Feb 19, 2020
c5c0191
[Monitoring] Add a section of auto-restart docker container.
yozhao101 Feb 20, 2020
4023874
[Monitoring] Correct the hyper-link.
yozhao101 Feb 20, 2020
7a84612
[Monitoring] Correct the typo in the hyper-link.
yozhao101 Feb 20, 2020
9941852
[Monitoring] Correct a typo in the hyper-link.
yozhao101 Feb 20, 2020
58c1f79
[Monitoring] Add a hyper-link for container feature table.
yozhao101 Feb 20, 2020
da03448
[Monitoring] Reword the sentence in the section of feature overview.
yozhao101 Feb 20, 2020
9884fc2
[Monitoring] Reword the sentences in the section of auto-restart
yozhao101 Feb 21, 2020
e0f0d96
[Doc-Monitoring] Reword the title and the section of feature overview.
yozhao101 Feb 24, 2020
1dc3a96
[Doc-monitoring] Reworded the sentences and fixed the typo.
yozhao101 Feb 24, 2020
0124b94
[Doc-monitoring] Reword and correct the typos.
yozhao101 Feb 24, 2020
0774344
[Doc-monitoring] Revised the functional requirement.
yozhao101 Feb 24, 2020
a852c35
[Doc-monitoring] Reword the basic approach.
yozhao101 Feb 24, 2020
a5d094b
[Doc-monitoring] Reworded basic approach and fix the typos.
yozhao101 Feb 24, 2020
93826e4
[Doc-monitoring] Correct the typo of supervisord.
yozhao101 Feb 24, 2020
0e84f87
[Doc-monitoring] When a process changes from running to exited, the
yozhao101 Feb 24, 2020
965fc61
[Doc-monitoring] Reword the mechanism of event listener to 'event
yozhao101 Feb 24, 2020
5c69e6e
[Doc-monitoring] Correct a typo and remove the init_cfg.json in line 90
yozhao101 Feb 25, 2020
a040c34
[Doc-monitoring] Reword the gives to provides in line 101.
yozhao101 Feb 25, 2020
a28459a
[Doc-monitoring] Reword the sentence "we emplyed 'event listener'
yozhao101 Feb 25, 2020
8b270be
[Doc-monitoring] Reword the line 68 to we leveraged the 'event listener'
yozhao101 Feb 25, 2020
7710e1b
[Doc-monitoring] Add the proposed section for memory, cpu and disk
yozhao101 Feb 25, 2020
6d73a9d
[Doc-monitoring] Add a section for the new proposal resource alerting.
yozhao101 Feb 25, 2020
d4c4fd4
[Doc-monitoring] Place the value of memory threshold in section 2.5.
yozhao101 Feb 25, 2020
f36f5ef
[Doc-monitoring] Reorganize the sections 2.2.3 and 2.2.4.
yozhao101 Feb 25, 2020
2edc3d4
[Doc-monitoring] Reword the section 2.2.2.
yozhao101 Feb 25, 2020
d041d78
[Doc-monitoring] Reword in the section 2.2.4 Monitoring Critical
yozhao101 Feb 25, 2020
f94d019
[Monitoring] Add a section to describe the relationship between
yozhao101 Mar 8, 2020
02bd31c
[Monitoring] Add a word "same" in the last sentence of section 2.2.1
yozhao101 Mar 8, 2020
fa20bea
[Monitrong] Reword the section 1.3.3.
yozhao101 Mar 9, 2020
e4d9a8d
[Monitoring] Correct a commection symbol.
yozhao101 Mar 9, 2020
7c917c0
[Monitoring] Fix a error for connection symbol.
yozhao101 Mar 9, 2020
3fad48f
[Monitoring] Swap the location of section 2.2.2 and section 2.2.3.
yozhao101 Mar 10, 2020
eb30432
[Monitoring] Correct a typo.
yozhao101 Mar 10, 2020
a84bfdf
[Monitoring] Delete an extra space.
yozhao101 Mar 10, 2020
8a908c2
[Monitoring] Delete the file which is added mistakenly.
yozhao101 Mar 10, 2020
7056a9f
[memory_restart] Add the description of monitoring the critical process
yozhao101 Jul 22, 2021
9b30502
[memory_restart] Fix the format issue.
yozhao101 Jul 22, 2021
dc80bcb
[memory_restart] Fix the format issues.
yozhao101 Jul 22, 2021
7ed89b7
[memory_restart] Change the syntax of `show` and `config` commands.
yozhao101 Jul 22, 2021
702e4d8
[mem_restart] Fix the typos.
yozhao101 Jul 22, 2021
91c5d9b
[mem_restart] Fix the typos.
yozhao101 Jul 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
276 changes: 276 additions & 0 deletions doc/monitoring_containers/monitoring_containers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
# Monitoring and Auto-Mitigating the Unhealthy of Docker Containers in SONiC
jleveque marked this conversation as resolved.
Show resolved Hide resolved

# High Level Design Document
#### Rev 0.1

# Table of Contents
* [List of Tables](#list-of-tables)
* [Revision](#revision)
* [About this Manual](#about-this-manual)
* [Scope](#scope)
* [Defintions/Abbreviation](#definitionsabbreviation)
* [1 Feature Overview](#1-feature-overview)
- [1.1 Requirements](#11-requirements)
- [1.1.1 Functional Requirements](#111-functional-requirements)
- [1.1.2 Configuration and Management Requirements](#112-configuration-and-management-requirements)
- [1.1.3 Scalability Requirements](#113-scalability-requirements)
- [1.2 Design](#12-design)
- [1.2.1 Basic Approach](#121-basic-approach)
* [2 Functionality](#2-functionality)
- [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases)
- [2.2 Functional Description](#22-functional-description)
- [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes)
- [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage)
- [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container)
- [2.2.3.1 CLI (and usage example)](#2231-cli-and-usage-example)
- [2.2.3.1.1 Show the Status of Auto-restart](#22311-show-the-status-of-auto-restart)
- [2.2.3.1.2 Configure the Status of Auto-restart](#22312-configure-the-status-of-auto-restart)
- [2.2.3.1.3 CONTAINER_FEATURE Table](#22313-container_feature-table)

# List of Tables
* [Table 1: Abbreviations](#definitionsabbreviation)

# Revision
| Rev | Date | Author | Change Description |
|:---:|:----------:|:----------------------:|---------------------------|
| 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version |

# About this Manual
This document presents the design and implementation of feature to monitor and auto-mitigate
the unhealthy of docker containers in SONiC.
jleveque marked this conversation as resolved.
Show resolved Hide resolved

# Scope
This document describes the high level design of feature to monitor and auto-mitigate
the unhealthy of docker containers.

# Definitions/Abbreviation
| Abbreviation | Description |
|--------------|------------------------------|
| Config DB | SONiC Configuration Database |
| CLI | Command Line Interface |

# 1 Feature Overview
SONiC is a collection of various switch applications which are held in docker containers
such as BGP container and SNMP container. Each application usually includes several processes which are
working together to provide and receive the services from other modules. As such, the healthy of
jleveque marked this conversation as resolved.
Show resolved Hide resolved
critical processes in each docker container are the key not only for the docker
jleveque marked this conversation as resolved.
Show resolved Hide resolved
container working correctly but also for the intended functionalities of entire SONiC switch.
On the other hand, profiling the resource usages and performance of each docker
container are also important for us to understand whether this container is in healthy state
or not and furtherly to provide us with deep insight about networking traffic.
jleveque marked this conversation as resolved.
Show resolved Hide resolved

The main purpose of this feature includes two parts: the first part is to monitor the
running status of each process and critical resource usage such as CPU, memory and disk
of each docker container.
The second part is docker containers can be automatically shut down and
restarted if one of critical processes running in the container exits unexpectedly. Restarting
the entire container ensures that configuration is reloaded and all processes in the container
get restarted, thus increasing the likelihood of entering a healthy state.
jleveque marked this conversation as resolved.
Show resolved Hide resolved

We implemented this feature by employing the existing Monit and supervisord system tools.
1. We used Monit system tool to detect whether a process is running or not and whether
the resource usage of a docker container is beyond the pre-defined threshold.
2. We leveraged the mechanism of event listener in supervisord to auto-restart a docker container
if one of its critical processes exited unexpectedly.
3. We also added a knob to make this auto-restart feature dynamically configurable.
Specifically users can run CLI to configure this feature residing in Config_DB as
enabled/disabled status.
jleveque marked this conversation as resolved.
Show resolved Hide resolved

## 1.1 Requirements

### 1.1.1 Functional Requirements
1. The Monit must provide the ability to generate an alert when a critical process is not
running.
2. The Monit must provide the ability to generate an alert when the resource usage of
jleveque marked this conversation as resolved.
Show resolved Hide resolved
a docker contaier is larger than the pre-defined threshold.
jleveque marked this conversation as resolved.
Show resolved Hide resolved
3. The event listener in supervisord must receive the signal when a critical process in
a docker container crashed or exited unexpectedly and then restart this docker
container.
4. CONFIG_DB can be configured to enable/disable this auto-restart feature for each docker
container..
5. Users can access this auto-restart information via the CLI utility
1. Users can see current auto-restart status for docker containers.
2. Users can configure auto-restart status for a specific docker container.

### 1.1.2 Configuration and Management Requirements
Configuration of the auto-restart feature can be done via:
jleveque marked this conversation as resolved.
Show resolved Hide resolved
1. init_cfg.json
jleveque marked this conversation as resolved.
Show resolved Hide resolved
2. CLI
jleveque marked this conversation as resolved.
Show resolved Hide resolved

### 1.1.3 Scalability Requirements
`Place holder`

## 1.2 Design

### 1.2.1 Basic Approach
Monitoring the running status of critical processes and resource usage of docker containers
are heavily depended on the Monit system tool. Since Monit already provided the mechanism
jleveque marked this conversation as resolved.
Show resolved Hide resolved
to check whether a process is running or not, it will be straightforward to integrate this into monitoring
the critical processes in SONiC. However, Monit only gives the method to monitor the resource
jleveque marked this conversation as resolved.
Show resolved Hide resolved
usage per process level not container level. As such, monitoring the resource usage of a docker
jleveque marked this conversation as resolved.
Show resolved Hide resolved
container will be an interesting and challenging problem. In our design, we adopted the way
that Monit will check the returned value of a script which reads the resource usage of docker
container, compares it with pre-defined threshold and then exited.
jleveque marked this conversation as resolved.
Show resolved Hide resolved

We employed the mechanism of event listener in supervisord to achieve auto-restarting of docker
container. Currently supervisord will monitor the running status of each process in SONiC
docker containers. If one critical process exited unexpectedly, supervisord will catch such signal
and send it to event listener. Then event listener will kill the process supervisord and
the entire docker container will be shut down and restarted.
jleveque marked this conversation as resolved.
Show resolved Hide resolved

# 2 Functionality
## 2.1 Target Deployment Use Cases
This feature is used to perform the following functions:
1. Monit will write an alert message into syslog if one if critical process exited unexpectedly.
jleveque marked this conversation as resolved.
Show resolved Hide resolved
2. Monit will write an alert message into syslog if the usage of memory is larger than the
pre-defined threshold for a docker container.
3. A docker container will auto-restart if one of its critical processes crashed or exited
unexpectedly.

## 2.2 Functional Description


### 2.2.1 Monitoring Critical Processes
Monit has implemented the mechanism to monitor whether a process is running or not. In detail,
jleveque marked this conversation as resolved.
Show resolved Hide resolved
Monit will periodically read the target processes from configuration file and tries to match
those process with the processes tree in Linux kernel.

Below is an example of Monit configuration file to monitor the critical processes in lldp
container.

*/etc/monit/conf.d/monit_lldp*
```bash
###############################################################################
# Monit configuration file for lldp container
# Process list:
# lldpd
# lldp_syncd
# lldpmgrd
###############################################################################
check process lldp_monitor matching "lldpd: "
if does not exit for 5 times within 5 cycles then alert
check process lldp_syncd matching "python2 -m lldp_syncd"
if does not exit for 5 times within 5 cycles then alert
check process lldpmgrd matching "python /usr/bin/lldpmgrd"
if does not exit for 5 times within 5 cycles then alert
```

### 2.2.2 Monitoring Critical Resource Usage
Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage
such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring
in the container level. Thus we propose a new design to achieve such monitoring based on Monit.
Specifically Monit will monitor a script and check its exit status. This script

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this script be able to detect any hang/loop or deadlock situation for the processes or
threads inside the container?

will correspondingly read the resource usage of docker containers, compare it with
pre-defined threshold and then return a value. The value 0 signified that
the resource usage is less than threshold and non-zero means Monit will send an alert since
current usage is larger than threshold.

Below is an example of Monit configuration file for lldp container to pass the pre-defined
threshold (bytes) to the script and check the exiting value.

```bash
check program container_memory_lldp with path "/usr/bin/memory_checker lldp 104857600"
if status != 0 then alert
```

### 2.2.3 Auto-restart Docker Container
The design principle behind this auto-restart feature is docker containers can be automatically shut down and
restarted if one of critical processes running in the container exits unexpectedly. Restarting
the entire container ensures that configuration is reloaded and all processes in the container
get restarted, thus increasing the likelihood of entering a healthy state.

Currently SONiC used superviord system tool to manage the processes in each
jleveque marked this conversation as resolved.
Show resolved Hide resolved
docker container. Actually auto-restarting docker container is based on the process
monitoring/notification framework. Specifically
if the state of process changes for example from running to exited,
an event notification `PROCESS_STATE_STOPPED` will be emitted by supervisord.
This event will be received by event listener. If the exited process is critical
one, then the event listener will terminate supervisord and the container will be shut down
and restarted.
jleveque marked this conversation as resolved.
Show resolved Hide resolved

We also introduced a knob which can enable or disable this auto-restart feature
jleveque marked this conversation as resolved.
Show resolved Hide resolved
dynamically according to the requirement of users. In detail, we created a table
named `CONTAINER_FEATURE` in Config_DB and this table includes the status of
auto-restart feature for each docker container. Users can easily use CLI to
check and configure the corresponding docker container status.


#### 2.2.3.1 CLI (and usage example)
The CLI tool will provide the following functionality:
1. Show current status of auto-restart feature for docker containers.
2. Configure the status of a specific docker container.

##### 2.2.3.1.1 Show the Status of Auto-restart
```
admin@sonic:~$ show container feature autorestart
Container Name Status
-------------------- --------
database disabled

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can database container is consistent with data after auto restart ?

lldp disabled
radv disabled
pmon disabled
sflow enabled
snmp enabled
telemetry enabled
bgp disabled
dhcp_relay disabled
rest-api enabled
teamd disabled
syncd enabled
swss disabled
```

##### 2.2.3.1.2 Configure the Status of Auto-restart
```
admin@sonic:~$ sudo config container feature autorestart database enabled
```


##### 2.2.3.1.3 CONTAINER_FEATURE Table
Example:
```
{
"CONTAINER_FEATURE": {
"database": {
"auto_restart": "enabled",
jleveque marked this conversation as resolved.
Show resolved Hide resolved
},
"lldp": {
"auto_restart": "disabled",
},
"radv": {
"auto_restart": "disabled",
},
"pmon": {
"auto_restart": "disabled",
},
"sflow": {
"auto_restart": "enabled",
},
"snmp": {
"auto_restart": "enabled",
},
"telemetry": {
"auto_restart": "enabled",
},
"bgp": {
"auto_restart": "disabled",
},
"dhcp_relay": {
"auto_restart": "disabled",
},
"rest-api": {
"auto_restart": "enabled",
},
"teamd": {
"auto_restart": "disabled",
},
"syncd": {
"auto_restart": "enabled",
},
"swss": {
"auto_restart": "disabled",
},

jleveque marked this conversation as resolved.
Show resolved Hide resolved
}
}
```