Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[services] Restart SwSS service upon unexpected critical process exit #2845

Merged
merged 11 commits into from
May 1, 2019
Merged

[services] Restart SwSS service upon unexpected critical process exit #2845

merged 11 commits into from
May 1, 2019

Conversation

jleveque
Copy link
Contributor

@jleveque jleveque commented Apr 30, 2019

- What I did
Restart SwSS service (and also restart dependent services) if any critical processes running in the swss container exit abnormally.

- How I did it

  • Add supervisor-proc-exit-listener event listener plugin for Supervisor in SwSS Docker container which in turn loads a list of critical processes for which to monitor for unexpected exits.
  • Configure swss.service to always auto-restart the service if it stops, with a delay of 30 seconds [Also set a rate limit of 3 restarts within 20 minutes (1200 seconds). If this number of restarts is exceeded, systemd will stop attempting to restart the service and place it in a 'failed' state. To restart the service after entering this state, one must first run systemctl reset-failed [we should probably also call this command in config load_minigraph before restarting services]
  • Add systemd dependencies to ensure dependents services also gets restarted along with SwSS
  • Also add "WantedBy=swss.service" option to unit files of services which need to be started with SwSS (currently teamd, snmp, dhcp_relay and radv). The "Requires=swss.service" option causes the dependent services to stop and restart along with SwSS (when calling systemctl stop swss.service and systemctl restart swss.service). However this will not cause them to start with SwSS (when calling systemctl start swss.service). This functionality is enabled with the addition of the "WantedBy=" option.
  • Also change the way the DHCP relay Docker container waits for interfaces to be ready before starting the relay agent process. Rather than using ip commands, now check STATE_DB for interface entries with "state" == "ok"
  • supervisor-proc-exit-listener script resides in files/scripts/ so that the same script can be installed in multiple Docker containers. To add this solution to another container, one simply needs to do the following:
    1. Add the script to the container's "_FILES" variable in the container's Makefile, and ensure it gets copied into the container in the container's Dockerfile
    2. Add a /etc/supervisor/critical_processes file to the container specifying all critical processes, one per line
    3. Add the event listener as a process to the container's supervisor config file

- How it Works

  • If a critical process running in the swss container crashes/exits abnormally, supervisor-proc-exit-listener will send a SIGTERM signal to Supervisor, causing it to exit also
  • Since Supervisor is running as PID 1 within the Docker container, when Supervisor process exits, it will cause the container to stop
  • When the SwSS Docker container stops, systemd will consider the swss.service to have stopped unexpectedly. Systemd will wait 30 seconds, then stop dependent services, restart SwSS, and lastly restart dependent services (currently syncd, teamd, snmp, dhcp_relay, radv and telemetry), unless it has reached the threshold for restarting, in which case it will move the service into a 'failed' state and not attempt to restart it.

- How to verify it

Send a signal to one of the critical processes to cause it to appear to exit abnormally (e.g., pkill -11 orchagent). Ensure the swss, syncd, teamd, snmp, dhcp_relay, radv and telemetry services get restarted per the above details.

NOTE: My updates to systemd dependencies in this PR also fixes #2752

@jleveque jleveque requested a review from lguohan April 30, 2019 20:43
@jleveque jleveque self-assigned this Apr 30, 2019
@lguohan lguohan merged commit 6eca27e into sonic-net:master May 1, 2019
@jleveque jleveque deleted the restart_swss branch May 1, 2019 17:58
vrfmgrd
nbrmgrd
vxlanmgrd
intfsyncd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intfsyncd is no longer present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I meant to delete that line, but forgot. It won't cause any issues, but I'll open a new PR to remove it soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR here: #2850

MichelMoriniaux pushed a commit to criteo-forks/sonic-buildimage that referenced this pull request May 28, 2019
…sonic-net#2845)

* [service] Restart SwSS Docker container if orchagent exits unexpectedly

* Configure systemd to stop restarting swss if it attempts to restart more than 3 times in 20 minutes

* Move supervisor-proc-exit-listener script

* [docker-dhcp-relay] Enhance wait_for_intf.sh.j2 to utilize STATEDB

* Ensure dependent services stop/start/restart with SwSS

* Change 'StartLimitInterval' to 'StartLimitIntervalSec', as Stretch installs systemd 232 (>= v230)

* Also update journald.conf options

* Remove 'PartOf' option from unit files

* Add '$(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)' to new shared docker-orchagent makefile

* Make supervisor-proc-exit-listener script read from 'critical_processes' file inside container

* Update critical_processes file for swss container
seiferteric pushed a commit to project-arlo/sonic-buildimage that referenced this pull request Oct 14, 2019
 message from community commit below:
[services] Restart SwSS service upon unexpected critical process exit (sonic-net#2845)

* [service] Restart SwSS Docker container if orchagent exits unexpectedly

* Configure systemd to stop restarting swss if it attempts to restart more than 3 times in 20 minutes

* Move supervisor-proc-exit-listener script

* [docker-dhcp-relay] Enhance wait_for_intf.sh.j2 to utilize STATEDB

* Ensure dependent services stop/start/restart with SwSS

* Change 'StartLimitInterval' to 'StartLimitIntervalSec', as Stretch installs systemd 232 (>= v230)

* Also update journald.conf options

* Remove 'PartOf' option from unit files

* Add '$(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)' to new shared docker-orchagent makefile

* Make supervisor-proc-exit-listener script read from 'critical_processes' file inside container

* Update critical_processes file for swss container

Change-Id: Ifd2383a4a3f6edfdf4d1ceffbd60e879673d7647
lguohan pushed a commit that referenced this pull request Nov 9, 2019
…cal process in syncd container exits unexpectedly (#3534)

Add the same mechanism I developed for the SwSS service in #2845 to the syncd service. However, in order to cause the SwSS service to also exit and restart in this situation, I developed a docker-wait-any program which the SwSS service uses to wait for either the swss or syncd containers to exit.
zhenggen-xu pushed a commit to zhenggen-xu/sonic-buildimage that referenced this pull request Jan 10, 2020
…cal process in syncd container exits unexpectedly (sonic-net#3534)

Add the same mechanism I developed for the SwSS service in sonic-net#2845 to the syncd service. However, in order to cause the SwSS service to also exit and restart in this situation, I developed a docker-wait-any program which the SwSS service uses to wait for either the swss or syncd containers to exit.
mssonicbld added a commit that referenced this pull request Jul 11, 2023
…lly (#15785)

#### Why I did it
src/sonic-swss
```
* 776af62c - (HEAD -> master, origin/master, origin/HEAD) [CodeQL]: Use dependencies with relevant versions in azp template. (#2845) (4 hours ago) [Nazarii Hnydyn]
```
#### How I did it
#### How to verify it
#### Description for the changelog
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…lly (sonic-net#15785)

#### Why I did it
src/sonic-swss
```
* 776af62c - (HEAD -> master, origin/master, origin/HEAD) [CodeQL]: Use dependencies with relevant versions in azp template. (sonic-net#2845) (4 hours ago) [Nazarii Hnydyn]
```
#### How I did it
#### How to verify it
#### Description for the changelog
mssonicbld added a commit that referenced this pull request Sep 25, 2023
…lly (#16642)

#### Why I did it
src/sonic-swss
```
* 0584d35b - (HEAD -> 202305, origin/202305) Revert "Support type7 encoded CAK key for macsec in config_db (#2892)" (3 minutes ago) [stormliang]
* 7097cf2b - Revert "[teamd]: Clean teamd process if LAG creation fails (#2888)" (3 days ago) [stormliang]
* a0eb0d07 - Support type7 encoded CAK key for macsec in config_db (#2892) (4 days ago) [judyjoseph]
* c7e5f10e - [teamd]: Clean teamd process if LAG creation fails (#2888) (4 days ago) [Lawrence Lee]
* f30b6107 - [CodeQL]: Use dependencies with relevant versions in azp template. (#2845) (4 days ago) [Nazarii Hnydyn]
```
#### How I did it
#### How to verify it
#### Description for the changelog
yxieca pushed a commit that referenced this pull request Oct 5, 2023
…lly (#16532)

src/sonic-swss

* de7186c6 - (HEAD -> 202205, origin/202205) [202205][CodeQL]: Use dependencies with relevant versions in azp template. (#2905) (13 days ago) [Nazarii Hnydyn]
* 106dd9ed - [CodeQL]: Use dependencies with relevant versions in azp template. (#2845) (3 weeks ago) [Nazarii Hnydyn]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dhcp_relay service stopped with "systemctl stop swss" but not restarted with "systemctl restart swss"
3 participants