Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202012] Fix race condition caused by strand wrap method #104 #110

Merged
merged 1 commit into from
Aug 12, 2022

Conversation

zjswhhh
Copy link
Contributor

@zjswhhh zjswhhh commented Aug 11, 2022

3f7a6f2 Fix race condition caused by strand wrap method (#104)

Description of PR

Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change

  • Bug fix
  • New feature
  • Doc/Design
  • Unit test

Approach

What is the motivation for this PR?

To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?

  1. Use boost::asio::post() to replace wrap.
  2. Add unit tests to cover this race condition scenario:
    • Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
    • Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
    • Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.

How did you verify/test it?

Unit tests. Verified:

  1. With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
  1. With boost::asio::post(), issue was not reproducible.

Any platform specific information?

Documentation

@zjswhhh zjswhhh force-pushed the cherry_pick_202012 branch 2 times, most recently from 37cc1b2 to 59e1cc5 Compare August 11, 2022 19:18
@lguohan
Copy link
Contributor

lguohan commented Aug 11, 2022

lgtm failed, is that expected?

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Aug 11, 2022

lgtm failed, is that expected?

Not really, it doesn't seem to be caused by PR changes.

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Aug 11, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@zjswhhh zjswhhh marked this pull request as draft August 11, 2022 23:10
Description of PR
Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change
 Bug fix
 New feature
 Doc/Design
 Unit test
Approach
What is the motivation for this PR?
To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?
Use boost::asio::post() to replace wrap.
Add unit tests to cover this race condition scenario:
Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.
How did you verify/test it?
Unit tests. Verified:

With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
With boost::asio::post(), issue was not reproducible.
@zjswhhh zjswhhh marked this pull request as ready for review August 12, 2022 17:56
@lguohan lguohan merged commit 86ddd95 into sonic-net:202012 Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants