Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[config] Call 'systemctl reset-failed' before 'systemctl restart' when restarting services #607

Merged
merged 3 commits into from
Aug 20, 2019
Merged

[config] Call 'systemctl reset-failed' before 'systemctl restart' when restarting services #607

merged 3 commits into from
Aug 20, 2019

Conversation

jleveque
Copy link
Contributor

This ensures that the process will get restarted, even if systemd has previously placed the service in the "failed" state, as systemctl reset-failed <service> will take the service out of the "failed" state.

Also, no longer print the command names as they are run; instead print a message stating that we are restarting each service.

The commands which currently call this function are config load_minigraph and config reload

This should address sonic-net/sonic-buildimage#3244

Copy link
Contributor

@nikos-github nikos-github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the error code returned by reset-failed if the services is not in failed state, doesn't

config/main.py Outdated
# We first run "systemctl reset-failed" to ensure that we
# can restart the service, even if it has entered a failed state
click.echo("Restarting {} ...".format(service))
run_command("systemctl reset-failed {}".format(service))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the error code returned by reset-failed if the services is not in failed state, doesn't trigger the exception. Looks good to me otherwise.

Copy link
Contributor Author

@jleveque jleveque Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling systemctl reset-failed on a service not in a failed state will still return 0. There should be no issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Thank you for fixing this.

@jleveque
Copy link
Contributor Author

Retest this please

@jleveque
Copy link
Contributor Author

jleveque commented Aug 14, 2019

I'm now contemplating changing this as follows:

  • Instead of calling systemctl reset-failed <service> on each service in the list, just call systemctl reset-failed once to clear the failed state of ALL failed services. This will also ensure all dependent services are removed from the failed state.

Thoughts, anyone?

Edit: I made the change in commit #ab28455

@avi-milner
Copy link

@jleveque i have tested this and it seems to fix the problem: peformed multiple config load_minigraph without any isse seen

Copy link

@avi-milner avi-milner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have tested this and it seems to fix the problem: peformed multiple config load_minigraph without any isse seen

config/main.py Outdated
# We first run "systemctl reset-failed" to remove the "failed"
# status from all services before we attempt to restart them
click.echo("Resetting all failed services ...")
run_command("systemctl reset-failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern I have is that this could possibly reset more services than the ones on the list we are meant to handle. Not sure if this is correct or desirable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe change approach to reset counter per service ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe that approach would be better.

@jleveque
Copy link
Contributor Author

Retest this please

@avi-milner
Copy link

Retest this please
hi reviwer requests you to use your first commit that resets counter per service and not globally,
i didnt see that you reverted 2nd commit

@jleveque
Copy link
Contributor Author

@avi-milner: Are you asking that I revert back to the original solution? What about dependent services? What if they (synd, teamd, etc.) are in the failed state? I don't believe they will get restarted.

@nikos-github
Copy link
Contributor

nikos-github commented Aug 15, 2019

@avi-milner: Are you asking that I revert back to the original solution? What about dependent services? What if they (synd, teamd, etc.) are in the failed state? I don't believe they will get restarted.

Are syncd and teamd systemd controlled? In either case, if there are dependent services, then we need to have a list of them and reset-failed those as well if we feel we should. A blind reset-failed for all services inside a sonic device, is not the right thing to do.

@jleveque
Copy link
Contributor Author

@nikos-github: Yes. All Docker containers are controlled by their own service. Multiple services are dependent upon SwSS (syncd, teamd, dhcp_relay, radv, snmp), so when SwSS restarts it will also restart its dependent services.

@avi-milner: What do you see wrong with removing all services from the failed list upon reloading config? Reloading config is pretty much like wiping the slate clean. Why not also reset the failed status of all services at that time?

@avi-milner
Copy link

avi-milner commented Aug 15, 2019 via email

@jleveque
Copy link
Contributor Author

I see. So you're good with this change?

@avi-milner
Copy link

avi-milner commented Aug 15, 2019 via email

@nikos-github
Copy link
Contributor

nikos-github commented Aug 15, 2019

@jleveque As I mentioned earlier, this is not the right approach. Resetting all failed services in the sonic device through the use of systemctl reset-failed is not correct. The services need to be explicitly specified. Not all services are under sonic's control in the switch and therefore no such assumption should be made.

@jleveque
Copy link
Contributor Author

Sorry for the confusion earlier. I seem to have gotten your responses mixed up.

@nikos-github: So you believe resetting all failed services upon reloading the configuration is not an assumption we can make? Like I said earlier, reloading configuration pretty much "wipes the slate clean"; at which point we would want to try to restart all services, even those which may have failed (which may have been due to bad configuration). Can you provide an example of a service which you would not want to remove from the failed list when reloading configuration?

@nikos-github
Copy link
Contributor

nikos-github commented Aug 15, 2019

@jleveque This really depends on how the user has configured their Linux system. We should try to restart all failed sonic services or at least the ones controlled or are dependent by config reload. Not all failed services in the system in general. When sonic configuration is reloaded, it should contain itself to sonic services. systemctl reset-failled will reset ALL failed services in the system whether they are sonic related or not.

@jleveque
Copy link
Contributor Author

@nikos-github: Please review my latest iteration. I now obtain a list of failed services and only restart each one if it is a service we will be attempting to restart (or a dependency thereof) by explicitly checking against a list.

@jleveque
Copy link
Contributor Author

Retest this please

@jleveque
Copy link
Contributor Author

@avi-milner: Can you test the current iteration of this PR to ensure it still fixes your problem?

@jleveque
Copy link
Contributor Author

Retest this please

@jleveque
Copy link
Contributor Author

Retest this please

1 similar comment
@jleveque
Copy link
Contributor Author

Retest this please

@avi-milner
Copy link

@avi-milner: Can you test the current iteration of this PR to ensure it still fixes your problem?

it works fine for me

@jleveque jleveque merged commit abd1dbc into sonic-net:master Aug 20, 2019
@jleveque jleveque deleted the reset_failed_before_restart branch August 20, 2019 18:00
Copy link

@avi-milner avi-milner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jleveque ,
it seems that the fix is not working as we expected,
when we call to config_reload / config_load/ config_load_minigraph
from this code you only reset failed counter for services that are already in failed state, this still causes the config load scenarios to fail after running them within the time window of 20 minutes, 3 times

can you please fix to always reset failed counter for config load commands ?

@@ -454,6 +494,9 @@ def load_minigraph():
if os.path.isfile(db_migrator) and os.access(db_migrator, os.X_OK):
run_command(db_migrator + ' -o set_version')

# We first run "systemctl reset-failed" to remove the "failed"
# status from all services before we attempt to restart them
_reset_failed_services()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would only remove services that are already in failed status

@@ -398,6 +435,9 @@ def reload(filename, yes, load_sysinfo):
if os.path.isfile(db_migrator) and os.access(db_migrator, os.X_OK):
run_command(db_migrator + ' -o migrate')

# We first run "systemctl reset-failed" to remove the "failed"
# status from all services before we attempt to restart them
_reset_failed_services()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would only remove services that are already in failed status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants