Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamd] prevent re-entrance of port priv change handler #2723

Merged
merged 1 commit into from
Apr 1, 2019

Conversation

yxieca
Copy link
Contributor

@yxieca yxieca commented Apr 1, 2019

- What I did

This change addresses an issue introduced by #2699

When adding a lag member dynamically after system boots up, teamd
port priv change handler could re-entrant itself and causing adding
operation to fail.

While handling PORT_CHANGE event, teamd_per_port.c port priv change
handler was called, it will then call runner_lacp to add port to lag,
the later causes IFINFO_CHANGE to be notified and calls the priv change
handler again, this re-entrance would cause runner_lacp port_added to
be called again and messes up with the previous adding sequence. Then
fails the lag member adding operation.

Prevent per port priv change handler re-entrance solves the problem.

Signed-off-by: Ying Xie ying.xie@microsoft.com

- How to verify it

  1. Passed incremental config nightly test over the weekend.

  2. Manually tested add a member port to a lag.

  3. 897 iterations of continuous warm reboot: 338 with [teamd] retry creating team_port after interface info changed #2699 only. And 559 with [teamd] retry creating team_port after interface info changed #2699 plus this change (test still running):

yinxi@acs-trusty8:/warmboot/cont_wb$ grep Rebooting wb-test-20190328-0631.log | grep 20190329 | wc -l
559
yinxi@acs-trusty8:/warmboot/cont_wb$ grep Rebooting wb-test-20190328-0631.log | grep -v 20190329 | wc -l
338

When adding a lag member dynamically after system boots up, teamd
port priv change handler could re-entrant itself and causing adding
operation to fail.

While handling PORT_CHANGE event, teamd_per_port.c port priv change
handler was called, it will then call runner_lacp to add port to lag,
the later causes IFINFO_CHANGE to be notified and calls the priv change
handler again, this re-entrance would cause runner_lacp port_added to
be called again and messes up with the previous adding sequence. Then
fails the lag member adding operation.

Prevent per port priv change handler re-entrance solves the problem.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@jipanyang
Copy link
Collaborator

I feel that we need to drill down the root cause of that device name missing issue. Patching with all the workaround doesn't seem to be a long term solution.

@yxieca
Copy link
Contributor Author

yxieca commented Apr 1, 2019

@jipanyang I agree with you. That might take longer time. If we identify the root cause and proper fix. I don't mind revert this change and #2699.

Frankly speaking, I thought about this. I think the other way to fix the issue is to delay creating LAG devices after warm reboot. But I don't know a good indication of 'readiness' to do so.

@yxieca yxieca merged commit fd3f611 into sonic-net:master Apr 1, 2019
@yxieca yxieca deleted the teamd branch April 1, 2019 23:51
yxieca added a commit that referenced this pull request Apr 1, 2019
When adding a lag member dynamically after system boots up, teamd
port priv change handler could re-entrant itself and causing adding
operation to fail.

While handling PORT_CHANGE event, teamd_per_port.c port priv change
handler was called, it will then call runner_lacp to add port to lag,
the later causes IFINFO_CHANGE to be notified and calls the priv change
handler again, this re-entrance would cause runner_lacp port_added to
be called again and messes up with the previous adding sequence. Then
fails the lag member adding operation.

Prevent per port priv change handler re-entrance solves the problem.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
tiantianlv pushed a commit to SONIC-DEV/sonic-buildimage that referenced this pull request Apr 10, 2019
When adding a lag member dynamically after system boots up, teamd
port priv change handler could re-entrant itself and causing adding
operation to fail.

While handling PORT_CHANGE event, teamd_per_port.c port priv change
handler was called, it will then call runner_lacp to add port to lag,
the later causes IFINFO_CHANGE to be notified and calls the priv change
handler again, this re-entrance would cause runner_lacp port_added to
be called again and messes up with the previous adding sequence. Then
fails the lag member adding operation.

Prevent per port priv change handler re-entrance solves the problem.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
tiantianlv pushed a commit to SONIC-DEV/sonic-buildimage that referenced this pull request Apr 10, 2019
When adding a lag member dynamically after system boots up, teamd
port priv change handler could re-entrant itself and causing adding
operation to fail.

While handling PORT_CHANGE event, teamd_per_port.c port priv change
handler was called, it will then call runner_lacp to add port to lag,
the later causes IFINFO_CHANGE to be notified and calls the priv change
handler again, this re-entrance would cause runner_lacp port_added to
be called again and messes up with the previous adding sequence. Then
fails the lag member adding operation.

Prevent per port priv change handler re-entrance solves the problem.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
tiantianlv pushed a commit to SONIC-DEV/sonic-buildimage that referenced this pull request Apr 10, 2019
When adding a lag member dynamically after system boots up, teamd
port priv change handler could re-entrant itself and causing adding
operation to fail.

While handling PORT_CHANGE event, teamd_per_port.c port priv change
handler was called, it will then call runner_lacp to add port to lag,
the later causes IFINFO_CHANGE to be notified and calls the priv change
handler again, this re-entrance would cause runner_lacp port_added to
be called again and messes up with the previous adding sequence. Then
fails the lag member adding operation.

Prevent per port priv change handler re-entrance solves the problem.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
yxieca pushed a commit that referenced this pull request Apr 20, 2023
…lly (#14750)

src/sonic-swss

* 96407be - (HEAD -> 202205, origin/202205) sonic-swss: Sync and program remote lag member status. (#2730) (11 hours ago) [Sambath Kumar Balasubramanian]
* 6051d33 - Fix race condition in sub-interface handling (#2723) (11 hours ago) [Stephen Sun]
* 7b95ec5 - [muxorch] handling multiple mux nexthops for route (#2656) (11 hours ago) [Nikola Dancejic]
mihirpat1 pushed a commit to mihirpat1/sonic-buildimage that referenced this pull request Jun 14, 2023
* Fix race condition in sub-interface handling

A sub-interface's becomes admin up in the following scenarios:
1. A port becomes admin up in STATE_DB.LAG_TABLE when there are sub-interfaces created based on it
2. A sub-interfaces is created when the parent port's admin_status is up in APPL_DB.LAG_TABLE

However, it's possible that
- the sub-interfaces have not been created while STATE_DB.LAG_TABLE is updated
- the LAGs, as the parent ports, are still admin-down in APPL_DB.LAG_TABLE when sub-interfaces are created.
In this scenario, sub-interfaces will keep admin-down for ever.

Fix: In (2), check LAG's admin status in STATE_DB instead of APPL_DB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants