Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BufferOrch] Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL #1786

Merged
merged 2 commits into from
Jun 29, 2021

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented Jun 16, 2021

What I did
Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL in order to avoid orchagent from exiting.
We need it only in 202106 or above. In 202012 the orchagent won't exit in such case.

Why I did it
Handle rare cases which cause SAI error eventually makes orchagent to exit.

How I verified it
Manually test.

Details if related
There is an optimization on orch: the SET event will be replaced by a DEL event if it is still pending in m_toSync when the DEL event is coming.
This is reasonable but can cause the SAI OID to be NULL in rare case:

  • The application creates an object and then destroy it after a very short period
  • The create notification is eliminated and replaced by destroying the notification
  • The SAI object hasn't been created and is NULL when it is removed

This causes SAI error which eventually makes orchagent exit.
We need to avoid it in 202106 and above
(In 202012, the orchagent won't exit so it's ok to ignore the error)
The solution is not to call SAI removing interface in case SAI OID is NULL

It can happen if a user configures something which causes the accumulative headroom to exceed the limit.
In this case, the buffer profile was created and then removed in a short time.

… DEL and the SAI OID is NULL

There is an optimization on orch: the SET event will be replaced by a DEL event
if it is still pending in m_toSync when the DEL event is coming.
This is reasonable but can cause the SAI OID be NULL in rare case:
 - The application creates an object and then destroy it after a very short period
 - The create notification is eliminated and replaced by destroy notification
 - The SAI object hasn't been created and is NULL when it is removed

This causes SAI error which eventually makes orchagent exit.
We need to avoid it in 202106 and above
(In 202012, the orchagent won't exit so it's ok to ignore the error)
The solution is not to call SAI removing interface in case SAI OID is NULL

It can happen if a user configures something which causes the accumulative headroom exceeds the limit.
In this case, the buffer profile was created and then removed in a short time.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs closed this Jun 16, 2021
@stephenxs stephenxs reopened this Jun 16, 2021
@stephenxs
Copy link
Collaborator Author

stephenxs commented Jun 17, 2021

Looks like the LGTM failed due to some temporary network issue and wasn't recovered after retriggered.

[2021-06-16 23:08:24] [build-stdout] Get:51 http://storage.googleapis.com/lgtm-apt-mirror/ubuntu eoan/main amd64 dh-exec amd64 0.23.2 [25.0 kB]
[2021-06-16 23:08:24] [build-stdout] Fetched 12.9 MB in 1min 10s (184 kB/s)
[2021-06-16 23:08:24] [build-stderr] E: Failed to fetch http://storage.googleapis.com/lgtm-apt-mirror/ubuntu/pool/main/libs/libsodium/libsodium-dev_1.0.18-1_amd64.deb  Connection failed [IP: 173.194.198.128 80]
[2021-06-16 23:08:24] [build-stderr] E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
[2021-06-16 23:08:24] [build-stderr] Error: 'lgtm-install-packages' terminated with exit code: 100

Copy link
Contributor

@neethajohn neethajohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we test this?

orchagent/bufferorch.cpp Outdated Show resolved Hide resolved
orchagent/bufferorch.cpp Show resolved Hide resolved
orchagent/bufferorch.cpp Show resolved Hide resolved
@prsunny
Copy link
Collaborator

prsunny commented Jun 17, 2021

@stephenxs , please do not close and re-open the PR multiple times for retriggering the checks. It can be done via /azp run. I think once the code changes/review is completed, this can be taken care of.

@stephenxs
Copy link
Collaborator Author

@stephenxs , please do not close and re-open the PR multiple times for retriggering the checks. It can be done via /azp run. I think once the code changes/review is completed, this can be taken care of.

Thanks Prince. Looks like we don’t have the privilege to trigger azp rerun. Anyway I’ll avoid closing/reopening the PRs

@stephenxs
Copy link
Collaborator Author

stephenxs commented Jun 17, 2021

How can we test this?

We have a regression test (test_exceeding_headroom) for buffer profile. We don’t have a chance to verify the buffer pool because we seldom remove a buffer pool. I updated it because they shared the same logic.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@neethajohn
Copy link
Contributor

How can we test this?

We have a regression test (test_exceeding_headroom) for buffer profile. We don’t have a chance to verify the buffer pool because we seldom remove a buffer pool. I updated it because they shared the same logic.

@stephenxs , does this test already cover this scenario for buffer profile?

@neethajohn
Copy link
Contributor

How can we test this?

We have a regression test (test_exceeding_headroom) for buffer profile. We don’t have a chance to verify the buffer pool because we seldom remove a buffer pool. I updated it because they shared the same logic.

@stephenxs , does this test already cover this scenario for buffer profile?

Discussed offline, it is covered

@liat-grozovik
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik liat-grozovik merged commit 3226163 into sonic-net:master Jun 29, 2021
@stephenxs stephenxs deleted the avoid-remove-null-sai-oid branch June 29, 2021 06:20
liat-grozovik pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 29, 2021
Advance submodule head for sonic-swss

3226163 [BufferOrch] Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL (sonic-net/sonic-swss#1786)
6c88e47 [Dynamic Buffer Calc][Mellanox] Bug fixes and enhancements for the lua plugins for buffer pool calculation and headroom checking (sonic-net/sonic-swss#1781)
e86b900 [MPLS] sonic-swss changes for MPLS (sonic-net/sonic-swss#1686)
4c8e2b5 [Dynamic Buffer Calc] Avoid creating lossy PG for admin down ports during initialization (sonic-net/sonic-swss#1776)
3602124 [VS test stability] Skip flaky test for DPB (sonic-net/sonic-swss#1807)
c37cc1c Support for in-band-mgmt via management VRF (sonic-net/sonic-swss#1726)
1e3a532 Fix config prompt question issue (sonic-net/sonic-swss#1799)

Signed-off-by: Stephen Sun <stephens@nvidia.com>
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this pull request Aug 7, 2021
Advance submodule head for sonic-swss

3226163 [BufferOrch] Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL (sonic-net/sonic-swss#1786)
6c88e47 [Dynamic Buffer Calc][Mellanox] Bug fixes and enhancements for the lua plugins for buffer pool calculation and headroom checking (sonic-net/sonic-swss#1781)
e86b900 [MPLS] sonic-swss changes for MPLS (sonic-net/sonic-swss#1686)
4c8e2b5 [Dynamic Buffer Calc] Avoid creating lossy PG for admin down ports during initialization (sonic-net/sonic-swss#1776)
3602124 [VS test stability] Skip flaky test for DPB (sonic-net/sonic-swss#1807)
c37cc1c Support for in-band-mgmt via management VRF (sonic-net/sonic-swss#1726)
1e3a532 Fix config prompt question issue (sonic-net/sonic-swss#1799)

Signed-off-by: Stephen Sun <stephens@nvidia.com>
shi-su pushed a commit to shi-su/sonic-swss that referenced this pull request Aug 17, 2021
…ase the op is DEL and the SAI OID is NULL (sonic-net#1786)

- What I did
Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL in order to avoid orchagent from exiting.
We need it only in 202106 or above. In 202012 the orchagent won't exit in such case.

- Why I did it
Handle rare cases which cause SAI error eventually makes orchagent to exit.

- How I verified it
Manually test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
shi-su pushed a commit to shi-su/sonic-swss that referenced this pull request Aug 17, 2021
…ase the op is DEL and the SAI OID is NULL (sonic-net#1786)

- What I did
Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL in order to avoid orchagent from exiting.
We need it only in 202106 or above. In 202012 the orchagent won't exit in such case.

- Why I did it
Handle rare cases which cause SAI error eventually makes orchagent to exit.

- How I verified it
Manually test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
shi-su added a commit that referenced this pull request Aug 18, 2021
What I did
Backport SAI failure handling related commits into the 202012 branch. The following is a list of backported commits:

941875a Deactivate mirror session only when session status is true in updateLagMember (#1666)
be12482 Ignore ALREADY_EXIST error in FDB creation (#1815)
c9c1aa2 Add failure handling for SAI get operations (#1768)
47b4276 [BufferOrch] Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL (#1786)
db9238f Add failure notification for orchagent (#1665)
fc8e43f [synchronous mode] Add failure notification for SAI failures in synchronous mode (#1596)

Why I did it
202012 image needs to include failure handling mechanism for enough notification in the presence of SAI failures.
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this pull request Oct 5, 2021
…ase the op is DEL and the SAI OID is NULL (sonic-net#1786)

- What I did
Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL in order to avoid orchagent from exiting.
We need it only in 202106 or above. In 202012 the orchagent won't exit in such case.

- Why I did it
Handle rare cases which cause SAI error eventually makes orchagent to exit.

- How I verified it
Manually test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants