Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait till CHASIS_APP_DB PING is successful, host_name and asic_name are valid in CONIFG_DB before starting chassis-db-cleanup #17962

Merged
merged 3 commits into from
Apr 18, 2024

Conversation

saksarav-nokia
Copy link
Contributor

@saksarav-nokia saksarav-nokia commented Jan 31, 2024

Why I did it

This PR fixes the issue reported in Issu #17945
We noticed that chassis db clean up is skipped sometimes when the CHASSIS_APP_DB PING fails. Also if host_name and asic_name are not written to CONIG_DB, it could pass the empty strings to CHASSIS_APP_DB EVAL commands.
The service hostname-config.service is restarted whenever the config-reload or load-minigraph is done and this services renames the file /etc/hosts to updates it with the new file. This interferes with swss@.service and when swss.sh script CHASSIS_APP_DPP when the /etc/hosts file is renamed, the error "Unable to connect to redis: Cannot assign requested address" is seen and the CHASSIS_APP_DB EVAL command fails. This causes the chassis db entries not getting cleaned up and causes orchagent crash in remote LC's.

Work item tracking
  • Microsoft ADO (number only):

How I did it

Wait till CHASS_APP_DB PING is successful before checking for entries in CHASSIS_APP_DB table. Also wait till host_name and asic_name are valis in CONFIG_DB.
Modified swss@.service to start after hostname-config.service

How to verify it

Ran a script with 200 times config reload & load-minigraph and verified that chassis db cleanup is done every time and the orchagent crash is not seen .

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • [x ] 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@saksarav-nokia
Copy link
Contributor Author

@judyjoseph for viz

@saksarav-nokia
Copy link
Contributor Author

@judyjoseph @arlakshm @abdosi , please review this PR

@saksarav-nokia
Copy link
Contributor Author

We ran the complete oc with this fix and the error "Unable to connect to redis: Cannot assign requested address" is not seen. Also we didn't see the orchagent crash

@judyjoseph
Copy link
Contributor

/AzurePipelines run

Copy link

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

@judyjoseph
Copy link
Contributor

/azp run Azure.sonic-buildimage

1 similar comment
@judyjoseph
Copy link
Contributor

/azp run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

judyjoseph
judyjoseph previously approved these changes Mar 5, 2024
Copy link
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me

@judyjoseph
Copy link
Contributor

@arlakshm @abdosi for review as well

@judyjoseph judyjoseph self-requested a review March 5, 2024 21:30
@judyjoseph
Copy link
Contributor

@saksarav-nokia, Trying to find an alternative solution here as this change to add hostname-config.service dependency will affect all platforms.

I checked this script, can we add a specific check in this script to proceed with changes in /etc/hosts file only if HOSTNAME changes ?:

.

That should help our case and we need not add this hostname-config.service dependency

@saksarav-nokia
Copy link
Contributor Author

@judyjoseph , I think that will also fix the issue. I will test it out and update the PR.

…re valid in CONIFG_DB before starting chassis-db-cleanup

Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
@saksarav-nokia
Copy link
Contributor Author

@judyjoseph , Addressed your comments and verified the changes and ensured the issue is not seen with current changes. Please review it.

@judyjoseph
Copy link
Contributor

/azp run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@judyjoseph
Copy link
Contributor

@rlhui @lguohan please help merge this PR

@gechiang
Copy link
Collaborator

MSFT ADO: 27704026

@rlhui rlhui merged commit d87ff46 into sonic-net:master Apr 18, 2024
21 checks passed
@gechiang gechiang added the Included in Chassis for 202205 Branch Indicate PR is already in MSFT repo 202205 branch label Apr 22, 2024
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Apr 23, 2024
…re valid in CONIFG_DB before starting chassis-db-cleanup (sonic-net#17962)

This PR fixes the issue reported in Issu sonic-net#17945
We noticed that chassis db clean up is skipped sometimes when the CHASSIS_APP_DB PING fails. Also if host_name and asic_name are not written to CONIG_DB, it could pass the empty strings to CHASSIS_APP_DB EVAL commands.
The service hostname-config.service is restarted whenever the config-reload or load-minigraph is done and this services renames the file /etc/hosts to updates it with the new file. This interferes with swss@.service and when swss.sh script CHASSIS_APP_DPP when the /etc/hosts file is renamed, the error "Unable to connect to redis: Cannot assign requested address" is seen and the CHASSIS_APP_DB EVAL command fails. This causes the chassis db entries not getting cleaned up and causes orchagent crash in remote LC's.

---------

Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #18756

mssonicbld pushed a commit that referenced this pull request Apr 23, 2024
…re valid in CONIFG_DB before starting chassis-db-cleanup (#17962)

This PR fixes the issue reported in Issu #17945
We noticed that chassis db clean up is skipped sometimes when the CHASSIS_APP_DB PING fails. Also if host_name and asic_name are not written to CONIG_DB, it could pass the empty strings to CHASSIS_APP_DB EVAL commands.
The service hostname-config.service is restarted whenever the config-reload or load-minigraph is done and this services renames the file /etc/hosts to updates it with the new file. This interferes with swss@.service and when swss.sh script CHASSIS_APP_DPP when the /etc/hosts file is renamed, the error "Unable to connect to redis: Cannot assign requested address" is seen and the CHASSIS_APP_DB EVAL command fails. This causes the chassis db entries not getting cleaned up and causes orchagent crash in remote LC's.

---------

Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
mlok-nokia pushed a commit to mlok-nokia/sonic-buildimage that referenced this pull request Jun 5, 2024
…re valid in CONIFG_DB before starting chassis-db-cleanup (sonic-net#17962)

This PR fixes the issue reported in Issu sonic-net#17945
We noticed that chassis db clean up is skipped sometimes when the CHASSIS_APP_DB PING fails. Also if host_name and asic_name are not written to CONIG_DB, it could pass the empty strings to CHASSIS_APP_DB EVAL commands.
The service hostname-config.service is restarted whenever the config-reload or load-minigraph is done and this services renames the file /etc/hosts to updates it with the new file. This interferes with swss@.service and when swss.sh script CHASSIS_APP_DPP when the /etc/hosts file is renamed, the error "Unable to connect to redis: Cannot assign requested address" is seen and the CHASSIS_APP_DB EVAL command fails. This causes the chassis db entries not getting cleaned up and causes orchagent crash in remote LC's.

---------

Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
@gechiang
Copy link
Collaborator

@yxieca , who can review/approve for 202311 for this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants