Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamd: Add support for custom retry counts for LACP sessions #13453

Merged
merged 14 commits into from
Jun 9, 2023

Conversation

saiarcot895
Copy link
Contributor

@saiarcot895 saiarcot895 commented Jan 20, 2023

Signed-off-by: Saikrishna Arcot sarcot@microsoft.com

ADO (Microsoft only): 22127474

Why I did it

This is to add support for specifying custom retry counts for LACP sessions. This is to make warmboot easier on low-storage and low-memory platforms, by allowing more than 90 seconds of downtime.

How I did it

How to verify it

Tested manually with these cases:

  • Verify that changing the retry count using teamdctl PortChannel101 state item set runner.retry_count 5 takes effect
  • Verify that the retry count change actually affects when the LAG goes down by forcefully killing teamd on one side (i.e. setting the retry count to 5 causes the LAG to go down after 150 seconds)
  • Verify that the retry count gets reset to 3 after the LAG goes down for whatever reason
  • Verify that the retry count gets reset to 3 after some period of time (30 seconds * retry count)

Test cases are in sonic-net/sonic-mgmt#7961 and sonic-net/sonic-mgmt#8152.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Tested and verified to be working on 202205 image.

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@yxieca
Copy link
Contributor

yxieca commented Jan 26, 2023

@saiarcot895 in the patch file, there seems to be a mix use of tabs and whitespaces, which caused indentation issues even just see from this PR. Can you unify the spaces?

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895
Copy link
Contributor Author

@saiarcot895 in the patch file, there seems to be a mix use of tabs and whitespaces, which caused indentation issues even just see from this PR. Can you unify the spaces?

I converted some lines that appeared to be using spaces to tabs, hopefully I got all of them.

+ lacp_port->tdport->ifname,
+ lacp_port->partner_retry_count,
+ lacpdu->v2.actor_retry_count);
+ lacp_port->partner_retry_count = lacpdu->v2.actor_retry_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we allways expect the actor and partner to have same retry_count ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The actor fields in the lacpdu packet that we received refer to the partner fields in lacp_port. In other words, in the LACP PDU that we received, the actor information refers to our peer's information (aka our partner).

Also fix a bug with `next_retry_count_change_time` getting calculated
repeatedly when a non-standard retry count is sent.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
yxieca
yxieca previously approved these changes Feb 22, 2023
@saiarcot895
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Also fix a logic error when filling in the retry count fields in the
LACP PDU packet.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
First, reply with the most recently received LACP PDU version. This is
so that if one side is sending both the old and new versions, then the
peer side will send both versions to keep both listeners happy. This
also allows for using the new version packets even when the retry count
is 3 on both sides. This can only be triggered remotely; there's no
local command to force a particular version to be used.

Second, reset the 60-second timer preventing the retry count from being
reset back to 3 every time a new version PDU is received. This keeps
that setting persistent for longer with the downside of a delayed
response to the retry count being reset to 3.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
After getting a LACPDU that uses a diferent LACPDU version, immediately
send a LACPDU back.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
+ lacp->cfg.retry_count = LACP_CFG_DFLT_RETRY_COUNT;
+ } else if (tmp < 3) {
+ teamd_log_err("\"retry_count\" value is out of its limits.");
+ return -EINVAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably generate the error message here and continue with retry count of LACP_CFG_DFLT_RETRY_COUNT, and proceed with LACP_CFG_DFLT_RETRY_COUNT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Date: 2023-01-18 14:26:36 -0800

After setting the retry count to some custom value, if a normal LACP
packet comes in without a custom retry count, don't reset it back to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this description correct? Should this be don't reset the retry count back to old/default retry count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if (err)
return err;
lacp_port->lacp->cfg.retry_count = LACP_CFG_DFLT_RETRY_COUNT;
+ lacp_port->last_received_lacpdu_version = 0x01;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the last received version set to 0x01?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the case where the LAG has gone down (changed to either expired or defaulted state); at that point, it's better to assume that the peer doesn't understand 0xf1, in case the peer changed, or the peer OS version changed to some older version; otherwise, the LAG may never come up.

Comment on lines +67 to +70
+ if (clock_gettime(CLOCK_MONOTONIC, &monotonic_time)) {
+ err = errno;
+ teamd_log_err("%s: unable to get current time: %s", lacp_port->tdport->ifname, strerror(err));
+ return -err;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this if block doing anything? I don't see monotonic_time that is collected here being used anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clock_gettime returns -1 for failure and 0 for success; that means if it gets into the if-block, then clock_gettime failed for some reason.

+ if (clock_gettime(CLOCK_MONOTONIC, &monotonic_time)) {
+ err = errno;
+ teamd_log_err("%s: unable to get current time: %s", lacp_port->tdport->ifname, strerror(err));
+ return -err;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to retry count if getting time fails? Will/should this be fatal for teamd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, with the current code, this would mean the LACPDU packet wouldn't even get processed, which would mean that the 90-second (or however many seconds based on the retry count) timer won't get refreshed, and the LAG would go down.

The timers that teamd uses also use CLOCK_MONOTONIC, so I'm inclined to think that if that call fails 3 times, something is messed up with the system.

}

+ if (lacp_port->last_received_lacpdu_version != lacpdu->version_number) {
+ teamd_log_dbg(lacp_port->ctx, "%s: LACPDU version changed from %u to %u",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation here looks not consistent w/ rest of the indentation in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

+ lacpdu->version_number);
+ lacp_port->last_received_lacpdu_version = lacpdu->version_number;
+ // Force-send a LACPDU packet acknowledging change in version
+ err = lacpdu_send(lacp_port);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same packet w/ changed version is sent here. Don't you also want to change retry count in the sent packet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new retry count is in this packet (it would've been updated in the previous if-else block, the lacp_port->partner_retry_count field).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
saiarcot895 added a commit to sonic-net/sonic-utilities that referenced this pull request Jun 2, 2023
* Add CLI configuration options for teamd retry count feature

Add a SONiC CLI to more easily configure the retry count for port
channels. This effectively acts like a wrapper around the underlying
teamdctl command.

Also add a python script that'll be installed into
/usr/local/bin/teamd_increase_retry_count.py that will detect if the
peer device likely supports this teamd feature (based on LLDP neighbor
info) and increases the teamd retry count to 5 in preparation for warm
upgrade. This script requires sudo to run.

This is tied to sonic-net/sonic-buildimage#13453.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add test for error case from teamd when it's not running

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix up test cases

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add some error handling if teamdctl doesn't exist

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add probe functionality and sending current LACPDU packet functionality

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Check to see if the retry count feature is enabled before doing a get or set

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add option to only send probe packets or only change retry count

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Call the teamd retry count script if doing a warm-reboot

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix pycheck errors, and disable scapy's IPv6 and verbose mode

Scapy's IPv6 support appears to have caused some issues with older
versions of scapy, which may be present on older SONiC images.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Make teamd retry count support optional

Don't fail warm reboot if teamd retry count support doesn't happen to be
present.

Also use fastfast-reboot for Mellanox devices.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address review comments, and restructure code to increase code coverage

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address some review comments

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Replace tabs with spaces

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Verify that expected keys are present in the data returned from teamdctl

Also update a failure message in the warm-reboot script if the retry
count script fails.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix TimeoutExpired undefined error

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add ability to mock subprocess calls (at a limited level)

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Return an actual subprocess object, and add a test for checking timeout

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Change variable syntax

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix set being accessed with an index

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add option to warm-reboot script to control if teamd retry count is required or not

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Move the teamd retry count check to before orchagent

This is so that in the case of the teamd retry count check failing,
there's fewer changes that happen on the system (it'll fail faster).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Move retry count script start to be prior to point-of-no-return

This doesn't need to be after the point-of-no-return, since this will
detach and be sending LACPDUs on its own.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Set executable bit

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address PR comments

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Change to case-insensitive string contains check

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Make sure the global abort variable is used

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

---------

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Copy link
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I see that the cutom retry is sent based on the RX of LACP PDU. Is there a way user can set this explicitly to override the default retry of 3 ?

@saiarcot895
Copy link
Contributor Author

The user can specify any custom retry count from the CLI (there's a sonic-utilities PR that was merged in with support for this). However, after some period of time, or some state changes (such as the link going down or the LACP session going down), it will reset back to 3 to comply with the current LACP standard.

@yxieca yxieca merged commit d466994 into sonic-net:master Jun 9, 2023
@saiarcot895 saiarcot895 deleted the teamd-retry-count branch June 27, 2023 01:29
pdhruv-marvell pushed a commit to pdhruv-marvell/sonic-utilities that referenced this pull request Aug 23, 2023
…t#2642)

* Add CLI configuration options for teamd retry count feature

Add a SONiC CLI to more easily configure the retry count for port
channels. This effectively acts like a wrapper around the underlying
teamdctl command.

Also add a python script that'll be installed into
/usr/local/bin/teamd_increase_retry_count.py that will detect if the
peer device likely supports this teamd feature (based on LLDP neighbor
info) and increases the teamd retry count to 5 in preparation for warm
upgrade. This script requires sudo to run.

This is tied to sonic-net/sonic-buildimage#13453.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add test for error case from teamd when it's not running

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix up test cases

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add some error handling if teamdctl doesn't exist

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add probe functionality and sending current LACPDU packet functionality

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Check to see if the retry count feature is enabled before doing a get or set

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add option to only send probe packets or only change retry count

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Call the teamd retry count script if doing a warm-reboot

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix pycheck errors, and disable scapy's IPv6 and verbose mode

Scapy's IPv6 support appears to have caused some issues with older
versions of scapy, which may be present on older SONiC images.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Make teamd retry count support optional

Don't fail warm reboot if teamd retry count support doesn't happen to be
present.

Also use fastfast-reboot for Mellanox devices.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address review comments, and restructure code to increase code coverage

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address some review comments

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Replace tabs with spaces

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Verify that expected keys are present in the data returned from teamdctl

Also update a failure message in the warm-reboot script if the retry
count script fails.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix TimeoutExpired undefined error

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add ability to mock subprocess calls (at a limited level)

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Return an actual subprocess object, and add a test for checking timeout

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Change variable syntax

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix set being accessed with an index

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add option to warm-reboot script to control if teamd retry count is required or not

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Move the teamd retry count check to before orchagent

This is so that in the case of the teamd retry count check failing,
there's fewer changes that happen on the system (it'll fail faster).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Move retry count script start to be prior to point-of-no-return

This doesn't need to be after the point-of-no-return, since this will
detach and be sending LACPDUs on its own.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Set executable bit

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address PR comments

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Change to case-insensitive string contains check

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Make sure the global abort variable is used

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

---------

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…et#13453)

Why I did it
This is to add support for specifying custom retry counts for LACP sessions. This is to make warmboot easier on low-storage and low-memory platforms, by allowing more than 90 seconds of downtime.

How I did it
How to verify it
Tested manually with these cases:

Verify that changing the retry count using teamdctl PortChannel101 state item set runner.retry_count 5 takes effect
Verify that the retry count change actually affects when the LAG goes down by forcefully killing teamd on one side (i.e. setting the retry count to 5 causes the LAG to go down after 150 seconds)
Verify that the retry count gets reset to 3 after the LAG goes down for whatever reason
Verify that the retry count gets reset to 3 after some period of time (30 seconds * retry count)
Test cases are in sonic-net/sonic-mgmt#7961 and sonic-net/sonic-mgmt#8152.


Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants