-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition caused by strand wrap
method
#104
Conversation
ea207a8
to
00f3019
Compare
Could you elaborate on why posting mux metrics run after setting mux state is problematic? |
When posting a "SwitchStart" event, linkmgrd will clean the whole mux metrics. And if orchagent reacts on "set mux state" before that, the Nikola observed this issue a lot in pilot cluster:
|
00f3019
to
6dda872
Compare
@zjswhhh this change cannot be cherry picked cleanly to 202205, can you check? |
Description of PR Summary: Fixes # (issue) This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand: In the following case: async_op_1(..., s.wrap(a)); async_op_2(..., s.wrap(b)); the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made. sign-off: Jing Zhang zhangjing@microsoft.com Type of change Bug fix New feature Doc/Design Unit test Approach What is the motivation for this PR? To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent. How did you do it? Use boost::asio::post() to replace wrap. Add unit tests to cover this race condition scenario: Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed. Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken. Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce. How did you verify/test it? Unit tests. Verified: With wrap, issue was reproduced 14/1000 times. [ FAILED ] 1 test, listed below: [ FAILED ] MuxManagerTest.DbInterfaceRaceConditionCheck test/MuxManagerTest.cpp:1014: Failure Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure Actual: true Expected: false ... With boost::asio::post(), issue was not reproducible.
Description of PR Summary: Fixes # (issue) This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand: In the following case: async_op_1(..., s.wrap(a)); async_op_2(..., s.wrap(b)); the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made. sign-off: Jing Zhang zhangjing@microsoft.com Type of change Bug fix New feature Doc/Design Unit test Approach What is the motivation for this PR? To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent. How did you do it? Use boost::asio::post() to replace wrap. Add unit tests to cover this race condition scenario: Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed. Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken. Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce. How did you verify/test it? Unit tests. Verified: With wrap, issue was reproduced 14/1000 times. [ FAILED ] 1 test, listed below: [ FAILED ] MuxManagerTest.DbInterfaceRaceConditionCheck test/MuxManagerTest.cpp:1014: Failure Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure Actual: true Expected: false ... With boost::asio::post(), issue was not reproducible.
Description of PR Summary: Fixes # (issue) This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand: In the following case: async_op_1(..., s.wrap(a)); async_op_2(..., s.wrap(b)); the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made. sign-off: Jing Zhang zhangjing@microsoft.com Type of change Bug fix New feature Doc/Design Unit test Approach What is the motivation for this PR? To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent. How did you do it? Use boost::asio::post() to replace wrap. Add unit tests to cover this race condition scenario: Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed. Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken. Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce. How did you verify/test it? Unit tests. Verified: With wrap, issue was reproduced 14/1000 times. [ FAILED ] 1 test, listed below: [ FAILED ] MuxManagerTest.DbInterfaceRaceConditionCheck test/MuxManagerTest.cpp:1014: Failure Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure Actual: true Expected: false ... With boost::asio::post(), issue was not reproducible.
Description of PR Summary: Fixes # (issue) This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand: In the following case: async_op_1(..., s.wrap(a)); async_op_2(..., s.wrap(b)); the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made. sign-off: Jing Zhang zhangjing@microsoft.com Type of change Bug fix New feature Doc/Design Unit test Approach What is the motivation for this PR? To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent. How did you do it? Use boost::asio::post() to replace wrap. Add unit tests to cover this race condition scenario: Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed. Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken. Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce. How did you verify/test it? Unit tests. Verified: With wrap, issue was reproduced 14/1000 times. [ FAILED ] 1 test, listed below: [ FAILED ] MuxManagerTest.DbInterfaceRaceConditionCheck test/MuxManagerTest.cpp:1014: Failure Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure Actual: true Expected: false ... With boost::asio::post(), issue was not reproducible.
Description of PR Summary: Fixes # (issue) This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand: In the following case: async_op_1(..., s.wrap(a)); async_op_2(..., s.wrap(b)); the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made. sign-off: Jing Zhang zhangjing@microsoft.com Type of change Bug fix New feature Doc/Design Unit test Approach What is the motivation for this PR? To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent. How did you do it? Use boost::asio::post() to replace wrap. Add unit tests to cover this race condition scenario: Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed. Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken. Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce. How did you verify/test it? Unit tests. Verified: With wrap, issue was reproduced 14/1000 times. [ FAILED ] 1 test, listed below: [ FAILED ] MuxManagerTest.DbInterfaceRaceConditionCheck test/MuxManagerTest.cpp:1014: Failure Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure Actual: true Expected: false ... With boost::asio::post(), issue was not reproducible.
Description of PR Summary: Fixes # (issue) This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand: In the following case: async_op_1(..., s.wrap(a)); async_op_2(..., s.wrap(b)); the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made. sign-off: Jing Zhang zhangjing@microsoft.com Type of change Bug fix New feature Doc/Design Unit test Approach What is the motivation for this PR? To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent. How did you do it? Use boost::asio::post() to replace wrap. Add unit tests to cover this race condition scenario: Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed. Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken. Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce. How did you verify/test it? Unit tests. Verified: With wrap, issue was reproduced 14/1000 times. [ FAILED ] 1 test, listed below: [ FAILED ] MuxManagerTest.DbInterfaceRaceConditionCheck test/MuxManagerTest.cpp:1014: Failure Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure Actual: true Expected: false ... With boost::asio::post(), issue was not reproducible.
Description of PR Summary: Fixes # (issue) This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand: In the following case: async_op_1(..., s.wrap(a)); async_op_2(..., s.wrap(b)); the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made. sign-off: Jing Zhang zhangjing@microsoft.com Type of change Bug fix New feature Doc/Design Unit test Approach What is the motivation for this PR? To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent. How did you do it? Use boost::asio::post() to replace wrap. Add unit tests to cover this race condition scenario: Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed. Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken. Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce. How did you verify/test it? Unit tests. Verified: With wrap, issue was reproduced 14/1000 times. [ FAILED ] 1 test, listed below: [ FAILED ] MuxManagerTest.DbInterfaceRaceConditionCheck test/MuxManagerTest.cpp:1014: Failure Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure Actual: true Expected: false ... With boost::asio::post(), issue was not reproducible.
What is the motivation for this PR? Fix the race condition of the default route notification. This is similar to #104 If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state. For example, for default route notifications like: [2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na [2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na". The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na". Signed-off-by: Longxiang Lyu lolv@microsoft.com Work item tracking Microsoft ADO (number only): 28471183 How did you do it? post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order. How did you verify/test it? without this PR, UT fail: Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
) What is the motivation for this PR? Fix the race condition of the default route notification. This is similar to sonic-net#104 If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state. For example, for default route notifications like: [2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na [2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na". The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na". Signed-off-by: Longxiang Lyu lolv@microsoft.com Work item tracking Microsoft ADO (number only): 28471183 How did you do it? post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order. How did you verify/test it? without this PR, UT fail: Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
What is the motivation for this PR? Fix the race condition of the default route notification. This is similar to #104 If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state. For example, for default route notifications like: [2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na [2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na". The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na". Signed-off-by: Longxiang Lyu lolv@microsoft.com Work item tracking Microsoft ADO (number only): 28471183 How did you do it? post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order. How did you verify/test it? without this PR, UT fail: Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
) What is the motivation for this PR? Fix the race condition of the default route notification. This is similar to sonic-net#104 If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state. For example, for default route notifications like: [2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na [2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na". The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na". Signed-off-by: Longxiang Lyu lolv@microsoft.com Work item tracking Microsoft ADO (number only): 28471183 How did you do it? post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order. How did you verify/test it? without this PR, UT fail: Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
) What is the motivation for this PR? Fix the race condition of the default route notification. This is similar to sonic-net#104 If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state. For example, for default route notifications like: [2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na [2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na". The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na". Signed-off-by: Longxiang Lyu lolv@microsoft.com Work item tracking Microsoft ADO (number only): 28471183 How did you do it? post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order. How did you verify/test it? without this PR, UT fail: Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
What is the motivation for this PR? Fix the race condition of the default route notification. This is similar to #104 If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state. For example, for default route notifications like: [2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na [2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na". The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na". Signed-off-by: Longxiang Lyu lolv@microsoft.com Work item tracking Microsoft ADO (number only): 28471183 How did you do it? post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order. How did you verify/test it? without this PR, UT fail: Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
) What is the motivation for this PR? Fix the race condition of the default route notification. This is similar to sonic-net#104 If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state. For example, for default route notifications like: [2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na [2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na". The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na". Signed-off-by: Longxiang Lyu lolv@microsoft.com Work item tracking Microsoft ADO (number only): 28471183 How did you do it? post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order. How did you verify/test it? without this PR, UT fail: Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
What is the motivation for this PR? Fix the race condition of the default route notification. This is similar to #104 If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state. For example, for default route notifications like: [2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na [2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na". The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na". Signed-off-by: Longxiang Lyu lolv@microsoft.com Work item tracking Microsoft ADO (number only): 28471183 How did you do it? post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order. How did you verify/test it? without this PR, UT fail: Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Description of PR
Summary:
Fixes # (issue)
This PR is to fix race condition introduced by
wrap
method ofboost::asio::io_service::strand
:sign-off: Jing Zhang zhangjing@microsoft.com
Type of change
Approach
What is the motivation for this PR?
To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.
How did you do it?
boost::asio::post()
to replacewrap
.setMuxState
,postMetricsEvent
in FakeDbinterface so the real implementation can be executed.setMuxState
,postMetricsEvent
in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.setMuxState
&postMetricsEvent
to reproduce.How did you verify/test it?
Unit tests. Verified:
wrap
, issue was reproduced 14/1000 times.boost::asio::post()
, issue was not reproducible.Any platform specific information?
Documentation