From b38bb4c799e58e74124ebe4f7cb0a2f70bf7687b Mon Sep 17 00:00:00 2001 From: sujinmkang Date: Fri, 12 Jun 2020 17:26:09 -0700 Subject: [PATCH 1/9] PCIe Monitor service --- doc/pcie-mon/pcie-daemon-hld.md | 94 +++++ images/pcie-mon.svg | 652 ++++++++++++++++++++++++++++++++ 2 files changed, 746 insertions(+) create mode 100644 doc/pcie-mon/pcie-daemon-hld.md create mode 100644 images/pcie-mon.svg diff --git a/doc/pcie-mon/pcie-daemon-hld.md b/doc/pcie-mon/pcie-daemon-hld.md new file mode 100644 index 00000000000..c29263dafda --- /dev/null +++ b/doc/pcie-mon/pcie-daemon-hld.md @@ -0,0 +1,94 @@ +# SONiC PCIe Monitor service HLD # + +### Rev 0.1 ### + +### Revision + | Rev | Date | Author | Change Description | + |:---:|:-----------:|:------------------:|-----------------------------------| + | 0.1 | | Sujin Kang | Initial version | + +## About This Manual ## + +This document is intend to monitor the platform PCIe devices and alert any problem on PCIe buses and devices. + + +## 1. PCIe Monitor service design ## + +New PCIe Monitor service is designed to use the PcieUtil utility to check the current status of PCIe devices and buses and alert if there is any missing devices or any error while communicating on the PCIe buses. + +### 1.1 Access PCIe devices and buses from platform container ### + +PCIe device information can be accessed via read files under (e.g. `/sys/bus/pci/devices/0000:01:00.1m`), different vendors may have under different folders, these folder need to be mounted to platform container so pcied can access them. + + +For the convenience of implementation and reduce the time consuming, pcie-mon.service will use the `pcieutil` which is the pcie diag tool. `pcieutil` is implemented based on platform_base.sonic_pcie.`PcieUtil` class. + +1. `PcieUtil` should get the platform specific PCIe device information and monitor the PCIe device and bus status. + +2. `PCIeUtil` will provide APIs `load_config_file`, `get_pcie_device` and `get_pcie_check` to get the expected PCIe device list and informations, to get the current PCIe device information, and check if any PCIe device is missing or if there is any PCIe bus error. + +![pcieinfo_design](https://github.com/Azure/SONiC/blob/master/doc/pcieinfo_design.md) + +### 1.2 PCIe device configuration file ### + +PcieUtil needs to get the expected PCIe device information to check the PCIe device status periodically, which is different for each platform/hardware sku. + +Each vendor need to generate the PCIe device configuration file name as pcie.yml and locate the file under device///plugins. + +Example) Location: `device/celestica/x86_64-cel_seastone-r0/plugins/pcie.yaml` + +``` +... +- bus: '01' + dev: '00' + fn: '0' + id: b960 + name: 'Ethernet controller: Broadcom Limited Broadcom BCM56960 Switch ASIC' +- bus: '01' + dev: '00' + fn: '1' + id: b960 + name: 'Ethernet controller: Broadcom Limited Broadcom BCM56960 Switch ASIC' +``` + + +### 1.3 PCIe device status check ### + + +Here we define a common platform API to in class `PcieBase`: + + @abc.abstractmethod + def get_pcie_check(self, timeout=0): + """ + Check Pcie device with config file + Returns: + A list including pcie device and test result info + """ + return [] + +Each vendor need to implement this function in `PcieBase` plugin if vendor has any additional pcie healthy check method. + +PcieUtil calls this API to check the PCIe device status, following example code showing how this API will be called: + + while True: + status, device_dict = platform_pcieutil.get_pcie_check() + if(status): + for key, value in device_dict.iteritems(): + print("Device on PCIe bus: %s" was %s" % (key, value)) + +### 1.3 PCIe daemon flow ### + +pcie-mon.service.timer will be started by systemd during boot up and it will trigger the pcie-mon.service to spawn a thread to check PCIe device status in 10 seconds after rc.local.service is completed and it will periodically spawn to monitor the PCIe devices every 1 minutes. + +Detailed flow as showed in below chart: +![](https://github.com/Azure/SONiC/blob/master/images/pcie-mon.svg) + + +< TBA > + +## Open Questions ## + +1. Current PcieUtil is limited to check the PCIe device availablility based on the configuration. + Can we also add the PCIe AER detection into get_pcie_check() api? + some plugins like, say, collectd (https://wiki.opnfv.org/display/fastpath/PCIe+Advanced+Error+Reporting+Plugin) + diff --git a/images/pcie-mon.svg b/images/pcie-mon.svg new file mode 100644 index 00000000000..4bd01642667 --- /dev/null +++ b/images/pcie-mon.svg @@ -0,0 +1,652 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + Process + + + + + Process.15.171 + Pcie-mon.timer + + + + + + + + + + + + + + + + + + + + Pcie-mon.timer + + Process.172 + Pcie-mon.service + + + + + + + + + + + + + + + + + + + + Pcie-mon.service + + Process.173 + pcieutil + + + + + + + + + + + + + + + + + + + + pcieutil + + Start/End.11.175 + systemd + + + + + + + + + + + + + + + + + + + + systemd + + Dynamic connector.176 + + + + + + + Dynamic connector.178 + + + + + + + Dynamic connector.179 + Yes + + + + + + + + Yes + + Decision.12.183 + 10 seconds on boot ? + + + + + + + + + + + + + + + + + + + + 10 seconds on boot ? + + Dynamic connector.184 + + + + + + + Subprocess.13.185 + Pcie-mon.sh + + + + + + + + + + + + + + + + + + + + + + + + Pcie-mon.sh + + Decision.187 + 1 minute ? + + + + + + + + + + + + + + + + + + + + 1 minute ? + + Dynamic connector.188 + Yes + + + + + + + + Yes + + Dynamic connector.191 + No + + + + + + + + No + + Dynamic connector.192 + No + + + + + + + + No + + Subprocess.193 + Pcie-mon.sh + + + + + + + + + + + + + + + + + + + + + + + + Pcie-mon.sh + + Dynamic connector.194 + Show platform pcieinfo -c + + + + + + + + Show platform pcieinfo -c + + Dynamic connector.195 + Show platform pcieinfo -c + + + + + + + + Show platform pcieinfo -c + + Subprocess.196 + pcieutil pcie_show pcieutil pcie_check + + + + + + + + + + + + + + + + + + + + + + + + pcieutil pcie_showpcieutil pcie_check + + Subprocess.197 + pcieutil pcie_show pcieutil pcie_check + + + + + + + + + + + + + + + + + + + + + + + + pcieutil pcie_showpcieutil pcie_check + + Data.14.198 + Pcie.yml + + + + + + + + + + + + + + + + + + + + Pcie.yml + + Process.199 + Platform_base.sonic_pcie.pcie_common.PcieUtil + + + + + + + + + + + + + + + + + + + + Platform_base.sonic_pcie.pcie_common.PcieUtil + + Subprocess.201 + get_pcie_device() get_pcie_check() + + + + + + + + + + + + + + + + + + + + + + + + get_pcie_device()get_pcie_check() + + Process.202 + Platform.plugins.PcieUtil + + + + + + + + + + + + + + + + + + + + Platform.plugins.PcieUtil + + Dynamic connector.205 + + + + + + + Dynamic connector.206 + + + + + + + Dynamic connector.207 + + + + + + + Process.209 + Platform.plugins + + + + + + + + + + + + + + + + + + + + Platform.plugins + + From 9ad1955bce9e572fa01dd7243ec24ed6da335599 Mon Sep 17 00:00:00 2001 From: sujinmkang Date: Tue, 16 Jun 2020 21:14:17 -0700 Subject: [PATCH 2/9] Update pcie monitoring service hld --- doc/pcie-mon/pcie-daemon-hld.md | 43 ++- images/pcie-mon.svg | 582 ++++++++------------------------ images/pcied.svg | 271 +++++++++++++++ 3 files changed, 445 insertions(+), 451 deletions(-) create mode 100644 images/pcied.svg diff --git a/doc/pcie-mon/pcie-daemon-hld.md b/doc/pcie-mon/pcie-daemon-hld.md index c29263dafda..26f9b598f78 100644 --- a/doc/pcie-mon/pcie-daemon-hld.md +++ b/doc/pcie-mon/pcie-daemon-hld.md @@ -1,25 +1,34 @@ -# SONiC PCIe Monitor service HLD # +# SONiC PCIe Monitoring services HLD # ### Rev 0.1 ### ### Revision - | Rev | Date | Author | Change Description | - |:---:|:-----------:|:------------------:|-----------------------------------| - | 0.1 | | Sujin Kang | Initial version | + | Rev | Date | Author | Change Description | + |:---:|:-----------:|:------------------:|------------------------------------------------| + | 0.1 | | Sujin Kang | Initial version | + | 0.2 | | Sujin Kang | Add rescan for pcie device missing during boot | + | | | | Add pcied to PMON for runtime monitoring | ## About This Manual ## -This document is intend to monitor the platform PCIe devices and alert any problem on PCIe buses and devices. +This document is intend to give the idea of how to monitor the platform PCIe devices and alert any problem on PCIe buses and devices on SONiC using pcie-mon service and pcied on PMON container. ## 1. PCIe Monitor service design ## New PCIe Monitor service is designed to use the PcieUtil utility to check the current status of PCIe devices and buses and alert if there is any missing devices or any error while communicating on the PCIe buses. -### 1.1 Access PCIe devices and buses from platform container ### +PCIe device monitoring will be done in two separate services, `pcie-mon.service` which is a systemd service, will monitor the PCIe device during the boot time and `pcied` which is a daemon in PMON container will monitor during the runtime. -PCIe device information can be accessed via read files under (e.g. `/sys/bus/pci/devices/0000:01:00.1m`), different vendors may have under different folders, these folder need to be mounted to platform container so pcied can access them. +First, pcie-mon.service will be added to check the pcie device enumeration status, trigger the pci device rescan if there is any missing device and indicate any device missing to the party that are interested in the device enumeration, for example, kernel_bde driver, platform drivers and etc. +Second, pcid in PMON will perform the periodic pcie device check during the run time. + +Both pcie-mon.service and pcied will update the state db with the PCIe device status whenever it changes. + +### 1.1 Access the PCIe devices and buses from platform ### + +PCIe device information can be accessed via read files under (e.g. `/sys/bus/pci/devices/0000:01:00.1`), different vendors may have under different folders, these folder need to be mounted to platform container so pcied can access them. For the convenience of implementation and reduce the time consuming, pcie-mon.service will use the `pcieutil` which is the pcie diag tool. `pcieutil` is implemented based on platform_base.sonic_pcie.`PcieUtil` class. @@ -51,10 +60,12 @@ Example) Location: `device/celestica/x86_64-cel_seastone-r0/plugins/pcie.yaml` name: 'Ethernet controller: Broadcom Limited Broadcom BCM56960 Switch ASIC' ``` - ### 1.3 PCIe device status check ### +The default PCIe device check function, get_pcie_check is implemented in PcieUtil class at sonic_platform_base/sonic_pcie/pcie_common.py. +It loads the PCIe device configuration file and compares them with the enumerated devices based on the platform sysfs device tree under /sys/bus/pci/devices/. + Here we define a common platform API to in class `PcieBase`: @abc.abstractmethod @@ -76,19 +87,25 @@ PcieUtil calls this API to check the PCIe device status, following example code for key, value in device_dict.iteritems(): print("Device on PCIe bus: %s" was %s" % (key, value)) -### 1.3 PCIe daemon flow ### +### 1.4 PCIe Monitor Service `pcie-mon.service` flow ### -pcie-mon.service.timer will be started by systemd during boot up and it will trigger the pcie-mon.service to spawn a thread to check PCIe device status in 10 seconds after rc.local.service is completed and it will periodically spawn to monitor the PCIe devices every 1 minutes. +pcie-mon.service will be started by systemd during boot up and it will spawn a thread to check PCIe device status and perform the rescan pci devices if there is any missing devices after rc.local.service is completed and it will update the state db with pcie device satus so that the dependent services/container or kernel driver can be started or stopped based on the status. Detailed flow as showed in below chart: ![](https://github.com/Azure/SONiC/blob/master/images/pcie-mon.svg) -< TBA > +### 1.5 PCIe daemon `pcied` flow ### +pcied will be started by PMON container after boot up and it will check the PCIe device status periodically every 1 minute and update the state db when the status is changed. + +Detailed flow as showed in below chart: +![](https://github.com/Azure/SONiC/blob/master/images/pcied.svg) + + +< TBA > ## Open Questions ## 1. Current PcieUtil is limited to check the PCIe device availablility based on the configuration. - Can we also add the PCIe AER detection into get_pcie_check() api? + Can we also add the PCIe communication error status check using AER detection into get_pcie_check() api or with a separate api? some plugins like, say, collectd (https://wiki.opnfv.org/display/fastpath/PCIe+Advanced+Error+Reporting+Plugin) - diff --git a/images/pcie-mon.svg b/images/pcie-mon.svg index 4bd01642667..2dfedb56ad6 100644 --- a/images/pcie-mon.svg +++ b/images/pcie-mon.svg @@ -3,7 +3,7 @@ + xml:space="preserve" color-interpolation-filters="sRGB" class="st15"> @@ -17,15 +17,17 @@ .st2 {fill:#ffffff;font-family:Franklin Gothic Book;font-size:0.666664em} .st3 {fill:#ffffff;stroke:#dfa202;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} .st4 {fill:#7c5b02;font-family:Franklin Gothic Book;font-size:0.666664em} - .st5 {marker-end:url(#mrkr4-18);stroke:#6bae2f;stroke-linecap:round;stroke-linejoin:round;stroke-width:0.5} + .st5 {marker-end:url(#mrkr4-12);stroke:#6bae2f;stroke-linecap:round;stroke-linejoin:round;stroke-width:0.5} .st6 {fill:#6bae2f;fill-opacity:1;stroke:#6bae2f;stroke-opacity:1;stroke-width:0.16556291390728} - .st7 {fill:#ffffff;stroke:none;stroke-linecap:butt;stroke-width:7.2} - .st8 {fill:#528722;font-family:Franklin Gothic Book;font-size:0.666664em} - .st9 {font-size:1em} - .st10 {fill:#7eb6aa} - .st11 {stroke:#ffffff;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} - .st12 {fill:#7eb6aa;stroke:#ffffff;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} - .st13 {fill:none;fill-rule:evenodd;font-size:12px;overflow:visible;stroke-linecap:square;stroke-miterlimit:3} + .st7 {fill:#7eb6aa} + .st8 {stroke:#ffffff;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} + .st9 {fill:#ffffff;font-family:Franklin Gothic Book;font-size:0.833336em} + .st10 {fill:#ffffff;stroke:none;stroke-linecap:butt;stroke-width:7.2} + .st11 {fill:#528722;font-family:Franklin Gothic Book;font-size:0.666664em} + .st12 {fill:#ffffff} + .st13 {stroke:#dea202;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} + .st14 {fill:#7b5a01;font-family:Franklin Gothic Book;font-size:0.833336em} + .st15 {fill:none;fill-rule:evenodd;font-size:12px;overflow:visible;stroke-linecap:square;stroke-miterlimit:3} ]]> @@ -33,7 +35,7 @@ - @@ -48,38 +50,8 @@ v:shadowOffsetY="-8.50394"/> - - Process.15.171 - Pcie-mon.timer - - - - - - - - - - - - - - - - - - - - Pcie-mon.timer - + + Process.172 Pcie-mon.service @@ -110,38 +82,7 @@ Pcie-mon.service - - Process.173 - pcieutil - - - - - - - - - - - - - - - - - - - - pcieutil - + Start/End.11.175 systemd @@ -173,71 +114,14 @@ systemd - - Dynamic connector.176 - - - - - - - Dynamic connector.178 - - - - - - + Dynamic connector.179 - Yes - - - - - Yes - - Decision.12.183 - 10 seconds on boot ? - - - - - - - - - - - - - - - - - - - - 10 seconds on boot ? - - Dynamic connector.184 - - - - + - + Subprocess.13.185 Pcie-mon.sh @@ -267,207 +151,90 @@ - - - - + + + + Pcie-mon.sh - - Decision.187 - 1 minute ? - - - - - - - - - + + Dynamic connector - - - - - + - - - - 1 minute ? - - Dynamic connector.188 - Yes + + + + Dynamic connector.203 - + - - - - - Yes - - Dynamic connector.191 - No + + + + Rectangle + /sys/bus/pci/rescan - + + - - - - No - - Dynamic connector.192 - No + + + /sys/bus/pci/rescan + + Dynamic connector.206 + Fail - + - - - - No - - Subprocess.193 - Pcie-mon.sh - - - - - - - - - - - - - - - - - - - - - - - - Pcie-mon.sh - - Dynamic connector.194 - Show platform pcieinfo -c + + + + Fail + + Dynamic connector.207 - + - - - - - Show platform pcieinfo -c - - Dynamic connector.195 - Show platform pcieinfo -c + + + + Diamond + Get_pcie_check? - + + - - - - Show platform pcieinfo -c - - Subprocess.196 - pcieutil pcie_show pcieutil pcie_check - - - - - - - - - + + + Get_pcie_check? + + Diamond.212 + Time < end? - - - - - + - - - - - - - pcieutil pcie_showpcieutil pcie_check - - Subprocess.197 - pcieutil pcie_show pcieutil pcie_check - - - - - - - - - + + + + Time < end? + + Dynamic connector.213 + Yes - - - - - - + - - - - - - - pcieutil pcie_showpcieutil pcie_check - - Data.14.198 - Pcie.yml + + + + + Yes + + Process.15 + Update STATE_DB @@ -490,84 +257,37 @@ - + - - - Pcie.yml - - Process.199 - Platform_base.sonic_pcie.pcie_common.PcieUtil - - - - - - - - - + + + Update STATE_DB + + Dynamic connector.230 + Pass - - - - - + - - - - Platform_base.sonic_pcie.pcie_common.PcieUtil - - Subprocess.201 - get_pcie_device() get_pcie_check() - - - - - - - - - + + + + + Pass + + Dynamic connector.231 + No - - - - - - + - - - - - - - get_pcie_device()get_pcie_check() - - Process.202 - Platform.plugins.PcieUtil + + + + + No + + Start/End.11 + Exit @@ -588,65 +308,51 @@ - - - + + + - - - Platform.plugins.PcieUtil - - Dynamic connector.205 - - - - - - - Dynamic connector.206 - - - - - - - Dynamic connector.207 + + + Exit + + Dynamic connector.240 - + - + - - Process.209 - Platform.plugins - - - - - - - - - + + Filled + pcie.yaml - - - - + + + + + + + + + + + + + + + + + + - - - - Platform.plugins + + + + + + pcie.yaml diff --git a/images/pcied.svg b/images/pcied.svg new file mode 100644 index 00000000000..4690d537ee8 --- /dev/null +++ b/images/pcied.svg @@ -0,0 +1,271 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + Process + + + + + + Process.172 + pcied + + + + + + + + + + + + + + + + + + + + pcied + + Start/End.11.175 + PMON + + + + + + + + + + + + + + + + + + + + PMON + + Dynamic connector.179 + + + + + + + Dynamic connector + + + + + + + Diamond + Get_pcie_check && status changed? + + + + + + + + Get_pcie_check && status changed? + + Process.15 + Update STATE_DB + + + + + + + + + + + + + + + + + + + + Update STATE_DB + + Dynamic connector.241 + Yes + + + + + + + + Yes + + Dynamic connector.243 + No + + + + + + + + No + + Process.15.244 + Delay 1 min + + + + + + + + + + + + + + + + + + + + Delay 1 min + + Dynamic connector.245 + + + + + + + Dynamic connector.246 + + + + + + + Filled + pcie.yaml + + + + + + + + + + + + + + + + + + + + + + + + + + + pcie.yaml + + From d92c0094b02688e453fad7f6d4efee5f0b820ccb Mon Sep 17 00:00:00 2001 From: sujinmkang Date: Thu, 18 Jun 2020 16:59:54 -0700 Subject: [PATCH 3/9] Move the update_state db to pcieutil so that it can be updated whenever the pcie device status gets checked. --- doc/pcie-mon/pcie-daemon-hld.md | 8 +-- images/pcie-mon.svg | 105 +++++++++++--------------------- images/pcied.svg | 80 +++++++++++++----------- 3 files changed, 85 insertions(+), 108 deletions(-) diff --git a/doc/pcie-mon/pcie-daemon-hld.md b/doc/pcie-mon/pcie-daemon-hld.md index 26f9b598f78..aa1b0be4610 100644 --- a/doc/pcie-mon/pcie-daemon-hld.md +++ b/doc/pcie-mon/pcie-daemon-hld.md @@ -32,9 +32,9 @@ PCIe device information can be accessed via read files under (e.g. `/sys/bus/pci For the convenience of implementation and reduce the time consuming, pcie-mon.service will use the `pcieutil` which is the pcie diag tool. `pcieutil` is implemented based on platform_base.sonic_pcie.`PcieUtil` class. -1. `PcieUtil` should get the platform specific PCIe device information and monitor the PCIe device and bus status. +1. `pcieutil` should get the platform specific PCIe device information and monitor the PCIe device and bus status with PcieUtil.get_pcie_check and update the STATE_DB based on get_pcie_check results. -2. `PCIeUtil` will provide APIs `load_config_file`, `get_pcie_device` and `get_pcie_check` to get the expected PCIe device list and informations, to get the current PCIe device information, and check if any PCIe device is missing or if there is any PCIe bus error. +2. `PcieUtil` will provide APIs `load_config_file`, `get_pcie_device` and `get_pcie_check` to get the expected PCIe device list and informations, to get the current PCIe device information, and check if any PCIe device is missing or if there is any PCIe bus error. ![pcieinfo_design](https://github.com/Azure/SONiC/blob/master/doc/pcieinfo_design.md) @@ -89,7 +89,7 @@ PcieUtil calls this API to check the PCIe device status, following example code ### 1.4 PCIe Monitor Service `pcie-mon.service` flow ### -pcie-mon.service will be started by systemd during boot up and it will spawn a thread to check PCIe device status and perform the rescan pci devices if there is any missing devices after rc.local.service is completed and it will update the state db with pcie device satus so that the dependent services/container or kernel driver can be started or stopped based on the status. +pcie-mon.service will be started by systemd during boot up and it will spawn a thread to check PCIe device status and perform the rescan pci devices if there is any missing devices after rc.local.service is completed and it will update the state db with pcie device satus during the `pcieutil pcie-chek` call so that the dependent services/container or kernel driver can be started or stopped based on the status. Detailed flow as showed in below chart: ![](https://github.com/Azure/SONiC/blob/master/images/pcie-mon.svg) @@ -97,7 +97,7 @@ Detailed flow as showed in below chart: ### 1.5 PCIe daemon `pcied` flow ### -pcied will be started by PMON container after boot up and it will check the PCIe device status periodically every 1 minute and update the state db when the status is changed. +pcied will be started by PMON container will continue monitoring the PCIe device status during run time and it will check the PCIe device status periodically every 1 minute and update the state db when the status is checked. Detailed flow as showed in below chart: ![](https://github.com/Azure/SONiC/blob/master/images/pcied.svg) diff --git a/images/pcie-mon.svg b/images/pcie-mon.svg index 2dfedb56ad6..256f239bb95 100644 --- a/images/pcie-mon.svg +++ b/images/pcie-mon.svg @@ -3,11 +3,12 @@ + xml:space="preserve" color-interpolation-filters="sRGB" class="st16"> + @@ -24,10 +25,11 @@ .st9 {fill:#ffffff;font-family:Franklin Gothic Book;font-size:0.833336em} .st10 {fill:#ffffff;stroke:none;stroke-linecap:butt;stroke-width:7.2} .st11 {fill:#528722;font-family:Franklin Gothic Book;font-size:0.666664em} - .st12 {fill:#ffffff} - .st13 {stroke:#dea202;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} - .st14 {fill:#7b5a01;font-family:Franklin Gothic Book;font-size:0.833336em} - .st15 {fill:none;fill-rule:evenodd;font-size:12px;overflow:visible;stroke-linecap:square;stroke-miterlimit:3} + .st12 {font-size:1em} + .st13 {fill:#ffffff} + .st14 {stroke:#dea202;stroke-linecap:round;stroke-linejoin:round;stroke-width:1} + .st15 {fill:#7b5a01;font-family:Franklin Gothic Book;font-size:0.833336em} + .st16 {fill:none;fill-rule:evenodd;font-size:12px;overflow:visible;stroke-linecap:square;stroke-miterlimit:3} ]]> @@ -168,7 +170,7 @@ - + Rectangle @@ -197,20 +199,21 @@ - + - + Diamond - Get_pcie_check? + Get_pcie_check ? (update STATE_DB in pcieutil) - - - Get_pcie_check? - + + + Get_pcie_check ?(update STATE_DB in pcieutil) + Diamond.212 Time < end? @@ -221,7 +224,7 @@ Time < end? - + Dynamic connector.213 Yes @@ -232,60 +235,29 @@ Yes - - Process.15 - Update STATE_DB - - - - - - - - - - - - - - - - - - - - Update STATE_DB - + Dynamic connector.230 Pass - - - - Pass - + + + + Pass + Dynamic connector.231 No - - - - No - + + + + No + Start/End.11 Exit @@ -317,14 +289,7 @@ Exit - - Dynamic connector.240 - - - - - - + Filled pcie.yaml @@ -340,8 +305,8 @@ - - + + @@ -350,9 +315,9 @@ - - - - pcie.yaml + + + + pcie.yaml diff --git a/images/pcied.svg b/images/pcied.svg index 4690d537ee8..53eb79d85ef 100644 --- a/images/pcied.svg +++ b/images/pcied.svg @@ -8,18 +8,19 @@ + @@ -55,7 +56,7 @@ Process.172 - Pcie-mon.service + Pcie-check.service @@ -83,7 +84,7 @@ - Pcie-mon.service + Pcie-check.service Start/End.11.175 systemd @@ -125,7 +126,7 @@ Subprocess.13.185 - Pcie-mon.sh + Pcie-check.sh @@ -157,7 +158,7 @@ - Pcie-mon.sh + Pcie-check.sh Dynamic connector @@ -203,7 +204,7 @@ Diamond - Get_pcie_check ? (update STATE_DB in pcieutil) + Get_pcie_check ? @@ -211,9 +212,8 @@ - Get_pcie_check ?(update STATE_DB in pcieutil) - + Get_pcie_check ? + Diamond.212 Time < end? @@ -224,7 +224,7 @@ Time < end? - + Dynamic connector.213 Yes @@ -233,31 +233,31 @@ - + Yes - + Dynamic connector.230 Pass - - - - Pass - + + + + Pass + Dynamic connector.231 No - - - - No - + + + + No + Start/End.11 Exit @@ -289,7 +289,14 @@ Exit - + + Dynamic connector.241 + + + + + + Filled pcie.yaml @@ -319,5 +326,40 @@ pcie.yaml + + Database + update STATE_DB "PCIE_STATUS|PCIE_DEVICES" "PASSED"/"FAILED" + + + + + + + + + + + + + + + + + + + + + + update STATE_DB"PCIE_STATUS|PCIE_DEVICES" "PASSED"/"FAILED" diff --git a/images/pcied.svg b/images/pcied.svg index 53eb79d85ef..f88dcdb83ad 100644 --- a/images/pcied.svg +++ b/images/pcied.svg @@ -3,7 +3,7 @@ + xml:space="preserve" color-interpolation-filters="sRGB" class="st17"> @@ -15,19 +15,22 @@ @@ -35,7 +38,7 @@ - @@ -51,20 +54,9 @@ - - Rectangle - 'pcieutil pcei-check' - - - - - - - - 'pcieutil pcei-check' - + Process.172 - pcied + Start `pcied` @@ -92,8 +84,8 @@ - pcied - + Start `pcied` + Start/End.11.175 PMON @@ -123,23 +115,23 @@ - PMON - + class="st3"/> + PMON + Dynamic connector.179 - + - + Dynamic connector - + - + Diamond Get_pcie_check && status changed? @@ -147,13 +139,35 @@ - - - Get_pcie_check && status changed? - - Process.15 - Update STATE_DB + + + Get_pcie_check && status changed? + + Dynamic connector.241 + Yes + + + + + + + + Yes + + Dynamic connector.243 + No + + + + + + + + No + + Process.15.244 + Delay 1 min @@ -176,37 +190,22 @@ - + - - - Update STATE_DB - - Dynamic connector.241 - Yes - - - - - - - - Yes - - Dynamic connector.243 - No + + + Delay 1 min + + Dynamic connector.246 - + - - - - - No - - Process.15.244 - Delay 1 min + + + + Database + update STATE_DB "PCIE_STATUS|PCIE_DEVICES" "PASSED"/"FAILED" @@ -229,27 +228,24 @@ - + - - - Delay 1 min - - Dynamic connector.245 + + + + + update STATE_DB"PCIE_STATUS|PCIE_DEVICES" "PASSED"/"FAILED" + + Dynamic connector.249 - - - - - - Dynamic connector.246 - - + - + - + Filled pcie.yaml @@ -265,8 +261,8 @@ - - + + @@ -275,9 +271,9 @@ - - - - pcie.yaml + + + + pcie.yaml From 70a152f1b98e145c9f0771e7cda7a951d98a978e Mon Sep 17 00:00:00 2001 From: sujinmkang Date: Thu, 25 Jun 2020 14:56:43 -0700 Subject: [PATCH 6/9] review comments --- doc/pcie-mon/pcie-monitoring-services-hld.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/pcie-mon/pcie-monitoring-services-hld.md b/doc/pcie-mon/pcie-monitoring-services-hld.md index a45fa9c1808..256e33c363b 100644 --- a/doc/pcie-mon/pcie-monitoring-services-hld.md +++ b/doc/pcie-mon/pcie-monitoring-services-hld.md @@ -11,16 +11,16 @@ ## About This Manual ## -This document is intend to give the idea of how to monitor the platform PCIe devices and alert any problem on PCIe buses and devices on SONiC using pcie-check service and pcied on PMON container. +This document is intended to give the idea of how to monitor the platform PCIe devices and alert of any problems on PCIe buses and devices on SONiC using pcie-check service and pcied on PMON container. ## 1. PCIe Monitor service design ## New PCIe Monitor service is designed to use the PcieUtil utility to check the current status of PCIe devices and buses and alert if there is any missing devices or any error while communicating on the PCIe buses. -PCIe device monitoring will be done in two separate services, `pcie-check.service` which is a systemd service, will monitor the PCIe device during the boot time and `pcied` which is a daemon in PMON container will monitor during the runtime. +PCIe device monitoring will be done in two separate services, `pcie-check.service` which is a systemd service, will check the PCIe device during the boot time and `pcied` which is a daemon in PMON container will monitor during the runtime. -First, pcie-check.service will be added to check the pcie device enumeration status, trigger the pci device rescan if there is any missing device and indicate any device missing to the party that are interested in the device enumeration, for example, kernel_bde driver, platform drivers and etc. +First, pcie-check.service will be added to check the pcie device enumeration status, trigger 15 maximum retries of a pci device rescan if there is any missing device and save the result status of pcie device check into the STATE_DB to indicate any device missing to the party that are interested in the device enumeration, for example, kernel_bde driver, platform drivers and etc. Second, pcied in PMON will perform the periodic pcie device check during the run time. From e5defda3d4a5b3391d14c017c928eed46e5d7880 Mon Sep 17 00:00:00 2001 From: sujinmkang Date: Tue, 7 Jul 2020 22:43:35 -0700 Subject: [PATCH 7/9] update the image link --- doc/pcie-mon/pcie-monitoring-services-hld.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/pcie-mon/pcie-monitoring-services-hld.md b/doc/pcie-mon/pcie-monitoring-services-hld.md index 256e33c363b..30e89b97349 100644 --- a/doc/pcie-mon/pcie-monitoring-services-hld.md +++ b/doc/pcie-mon/pcie-monitoring-services-hld.md @@ -92,7 +92,7 @@ PcieUtil calls this API to check the PCIe device status, following example code pcie-check.service will be started by systemd during boot up and it will spawn a thread to check PCIe device status and perform the rescan pci devices if there is any missing devices after rc.local.service is completed and it will update the state db with pcie device satus after the `pcieutil pcie-chek` call so that the dependent services/container or kernel driver can be started or stopped based on the status. Detailed flow as showed in below chart: -![](https://github.com/Azure/SONiC/blob/master/images/pcie-check.svg) +![](https://github.com/Azure/SONiC/blob/70a152f1b98e145c9f0771e7cda7a951d98a978e/images/pcie-check.svg) ### 1.5 PCIe daemon `pcied` flow ### @@ -100,7 +100,7 @@ Detailed flow as showed in below chart: pcied will be started by PMON container will continue monitoring the PCIe device status during run time and it will check the PCIe device status periodically every 1 minute and update the state db when the status is checked. Detailed flow as showed in below chart: -![](https://github.com/Azure/SONiC/blob/master/images/pcied.svg) +![](https://github.com/Azure/SONiC/blob/70a152f1b98e145c9f0771e7cda7a951d98a978e/images/pcied.svg) ### 1.6 STATE_DB keys and value ### From d35db846ea13f07bef9004bf58ecc41a03071dcc Mon Sep 17 00:00:00 2001 From: sujinmkang Date: Mon, 13 Jul 2020 14:11:55 -0700 Subject: [PATCH 8/9] fix the retry number of pcie rescan. --- doc/pcie-mon/pcie-monitoring-services-hld.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/pcie-mon/pcie-monitoring-services-hld.md b/doc/pcie-mon/pcie-monitoring-services-hld.md index 30e89b97349..eb032635aab 100644 --- a/doc/pcie-mon/pcie-monitoring-services-hld.md +++ b/doc/pcie-mon/pcie-monitoring-services-hld.md @@ -20,7 +20,7 @@ New PCIe Monitor service is designed to use the PcieUtil utility to check the cu PCIe device monitoring will be done in two separate services, `pcie-check.service` which is a systemd service, will check the PCIe device during the boot time and `pcied` which is a daemon in PMON container will monitor during the runtime. -First, pcie-check.service will be added to check the pcie device enumeration status, trigger 15 maximum retries of a pci device rescan if there is any missing device and save the result status of pcie device check into the STATE_DB to indicate any device missing to the party that are interested in the device enumeration, for example, kernel_bde driver, platform drivers and etc. +First, pcie-check.service will be added to check the pcie device enumeration status, trigger a retry of a pci device rescan if there is any missing device and save the result status of pcie device check into the STATE_DB to indicate any device missing to the party that are interested in the device enumeration, for example, kernel_bde driver, platform drivers and etc. Second, pcied in PMON will perform the periodic pcie device check during the run time. From 4e93b8b30609ffc698ac801cd12740c35d2ea284 Mon Sep 17 00:00:00 2001 From: sujinmkang Date: Mon, 13 Jul 2020 15:34:49 -0700 Subject: [PATCH 9/9] review comment --- doc/pcie-mon/pcie-monitoring-services-hld.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/pcie-mon/pcie-monitoring-services-hld.md b/doc/pcie-mon/pcie-monitoring-services-hld.md index eb032635aab..90d3a978b7c 100644 --- a/doc/pcie-mon/pcie-monitoring-services-hld.md +++ b/doc/pcie-mon/pcie-monitoring-services-hld.md @@ -42,7 +42,7 @@ For the convenience of implementation and reduce the time consuming, pcie-check. PcieUtil needs to get the expected PCIe device information to check the PCIe device status periodically, which is different for each platform/hardware sku. -Each vendor need to generate the PCIe device configuration file name as pcie.yml and locate the file under device///plugins. +Each vendor need to generate the PCIe device configuration file name as pcie.yml and locate the file under `device///plugins`. Example) Location: `device/celestica/x86_64-cel_seastone-r0/plugins/pcie.yaml`