Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition caused by strand wrap method #104

Merged
merged 5 commits into from
Aug 9, 2022

Conversation

zjswhhh
Copy link
Contributor

@zjswhhh zjswhhh commented Aug 8, 2022

Description of PR

Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change

  • Bug fix
  • New feature
  • Doc/Design
  • Unit test

Approach

What is the motivation for this PR?

To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?

  1. Use boost::asio::post() to replace wrap.
  2. Add unit tests to cover this race condition scenario:
    • Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
    • Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
    • Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.

How did you verify/test it?

Unit tests. Verified:

  1. With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
  1. With boost::asio::post(), issue was not reproducible.

Any platform specific information?

Documentation

@lolyu
Copy link
Contributor

lolyu commented Aug 8, 2022

Could you elaborate on why posting mux metrics run after setting mux state is problematic?

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Aug 8, 2022

Could you elaborate on why posting mux metrics run after setting mux state is problematic?

When posting a "SwitchStart" event, linkmgrd will clean the whole mux metrics. And if orchagent reacts on "set mux state" before that, the orch_switch_<active/standby>_start timestamp will be removed.

Nikola observed this issue a lot in pilot cluster:

24495 PORT        EVENT                          TIME
24496 ----------  -----------------------------  ---------------------------
 
24497 Ethernet52  linkmgrd_switch_standby_start  2022-Jul-25 09:53:29.214514
 
24498 Ethernet52  orch_switch_standby_end        2022-Jul-25 09:53:29.238199
 
24499 Ethernet52  xcvrd_switch_standby_start     2022-Jul-25 09:53:29.251985
 
24500 Ethernet52  xcvrd_switch_standby_end       2022-Jul-25 09:53:29.256608
 
24501 Ethernet52  linkmgrd_switch_standby_end    2022-Jul-25 09:53:29.259217

@zjswhhh zjswhhh requested review from yxieca and lolyu August 8, 2022 17:47
@zjswhhh zjswhhh merged commit 3f7a6f2 into sonic-net:master Aug 9, 2022
@zjswhhh zjswhhh deleted the boost_asio_strand branch August 9, 2022 17:42
@yxieca
Copy link
Contributor

yxieca commented Aug 11, 2022

@zjswhhh this change cannot be cherry picked cleanly to 202205, can you check?

yxieca pushed a commit that referenced this pull request Aug 11, 2022
Description of PR
Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change
 Bug fix
 New feature
 Doc/Design
 Unit test
Approach
What is the motivation for this PR?
To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?
Use boost::asio::post() to replace wrap.
Add unit tests to cover this race condition scenario:
Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.
How did you verify/test it?
Unit tests. Verified:

With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
With boost::asio::post(), issue was not reproducible.
zjswhhh added a commit to zjswhhh/sonic-linkmgrd that referenced this pull request Aug 11, 2022
Description of PR
Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change
 Bug fix
 New feature
 Doc/Design
 Unit test
Approach
What is the motivation for this PR?
To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?
Use boost::asio::post() to replace wrap.
Add unit tests to cover this race condition scenario:
Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.
How did you verify/test it?
Unit tests. Verified:

With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
With boost::asio::post(), issue was not reproducible.
zjswhhh added a commit to zjswhhh/sonic-linkmgrd that referenced this pull request Aug 11, 2022
Description of PR
Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change
 Bug fix
 New feature
 Doc/Design
 Unit test
Approach
What is the motivation for this PR?
To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?
Use boost::asio::post() to replace wrap.
Add unit tests to cover this race condition scenario:
Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.
How did you verify/test it?
Unit tests. Verified:

With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
With boost::asio::post(), issue was not reproducible.
zjswhhh added a commit to zjswhhh/sonic-linkmgrd that referenced this pull request Aug 11, 2022
Description of PR
Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change
 Bug fix
 New feature
 Doc/Design
 Unit test
Approach
What is the motivation for this PR?
To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?
Use boost::asio::post() to replace wrap.
Add unit tests to cover this race condition scenario:
Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.
How did you verify/test it?
Unit tests. Verified:

With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
With boost::asio::post(), issue was not reproducible.
zjswhhh added a commit to zjswhhh/sonic-linkmgrd that referenced this pull request Aug 11, 2022
Description of PR
Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change
 Bug fix
 New feature
 Doc/Design
 Unit test
Approach
What is the motivation for this PR?
To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?
Use boost::asio::post() to replace wrap.
Add unit tests to cover this race condition scenario:
Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.
How did you verify/test it?
Unit tests. Verified:

With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
With boost::asio::post(), issue was not reproducible.
zjswhhh added a commit to zjswhhh/sonic-linkmgrd that referenced this pull request Aug 12, 2022
Description of PR
Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change
 Bug fix
 New feature
 Doc/Design
 Unit test
Approach
What is the motivation for this PR?
To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?
Use boost::asio::post() to replace wrap.
Add unit tests to cover this race condition scenario:
Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.
How did you verify/test it?
Unit tests. Verified:

With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
With boost::asio::post(), issue was not reproducible.
lguohan pushed a commit that referenced this pull request Aug 12, 2022
Description of PR
Summary:
Fixes # (issue)
This PR is to fix race condition introduced by wrap method of boost::asio::io_service::strand:

In the following case:

async_op_1(..., s.wrap(a));
async_op_2(..., s.wrap(b));

the completion of the first async operation will perform s.dispatch(a), and the second will perform s.dispatch(b), but the order in which those are performed is unspecified. That is, you cannot state whether one happens-before the other. Therefore, none of the above conditions are met and no ordering guarantee is made.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change
 Bug fix
 New feature
 Doc/Design
 Unit test
Approach
What is the motivation for this PR?
To fix the bug that linkgmrd cleans mux metrics table after setting mux state to orchagent.

How did you do it?
Use boost::asio::post() to replace wrap.
Add unit tests to cover this race condition scenario:
Remove override of setMuxState, postMetricsEvent in FakeDbinterface so the real implementation can be executed.
Override setMuxState, postMetricsEvent in FakeMuxPort as LinkManagerStateMachine tests use fake mux ports, so existing tests won't be broken.
Initiate thread pool in MuxManager test to better mimic, and run 1000 times setMuxState & postMetricsEvent to reproduce.
How did you verify/test it?
Unit tests. Verified:

With wrap, issue was reproduced 14/1000 times.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MuxManagerTest.DbInterfaceRaceConditionCheck
test/MuxManagerTest.cpp:1014: Failure
Value of: mDbInterfacePtr->mDbInterfaceRaceConditionCheckFailure
  Actual: true
Expected: false
...
With boost::asio::post(), issue was not reproducible.
yxieca pushed a commit that referenced this pull request Jun 21, 2024
What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to #104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu lolv@microsoft.com

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-linkmgrd that referenced this pull request Jun 21, 2024
)

What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to sonic-net#104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu lolv@microsoft.com

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
mssonicbld pushed a commit that referenced this pull request Jun 21, 2024
What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to #104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu lolv@microsoft.com

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
lolyu added a commit to lolyu/sonic-linkmgrd that referenced this pull request Jun 24, 2024
)

What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to sonic-net#104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu lolv@microsoft.com

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-linkmgrd that referenced this pull request Aug 6, 2024
)

What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to sonic-net#104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu lolv@microsoft.com

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
mssonicbld pushed a commit that referenced this pull request Aug 6, 2024
What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to #104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu lolv@microsoft.com

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-linkmgrd that referenced this pull request Aug 21, 2024
)

What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to sonic-net#104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu lolv@microsoft.com

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
mssonicbld pushed a commit that referenced this pull request Aug 21, 2024
What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to #104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu lolv@microsoft.com

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants