Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add port flap count and last flap timestamp to APPL_DB #3052

Merged
merged 6 commits into from
Mar 12, 2024

Conversation

prgeor
Copy link
Contributor

@prgeor prgeor commented Feb 17, 2024

What I did
Add port flap count and last flap timestamp to APPL_DB

Why I did it
To be able to get number of times the port flapped over the life of the port
Record the last time when the port was UP/DOWN

How I verified it

  1. Verified the flap count increments when the port operational status changes
  2. Time stamp is recorded when the port went UP/DOWN last time
  3. Verified the flap count and timestamp are restored on warm-reboot

Details if related

image

image

Copy link
Collaborator

@dgsudharsan dgsudharsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add UT. You can add gtest to simulate port oper change event

@@ -5307,9 +5339,14 @@ bool PortsOrch::initializePort(Port &port)
{
operStatus = fvValue(i);
}

if (fvField(i) == "flap_count")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to handle reconciliation during warm/fast reboot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor prgeor force-pushed the portflap branch 4 times, most recently from 6226cdb to d3abdc2 Compare February 24, 2024 08:14
@prgeor
Copy link
Contributor Author

prgeor commented Feb 24, 2024

Please add UT. You can add gtest to simulate port oper change event

@dgsudharsan please check the test added

@dgsudharsan
Copy link
Collaborator

Please add UT. You can add gtest to simulate port oper change event

@dgsudharsan please check the test added

Can you please check why the test is failing in the pipeline?

@prgeor prgeor force-pushed the portflap branch 3 times, most recently from 1192b5f to fe41dd4 Compare February 26, 2024 18:07
orchagent/portsorch.cpp Outdated Show resolved Hide resolved
}
catch (const std::exception &e)
{
SWSS_LOG_ERROR("Failed to get port (%s) flap_count: %s", port.m_alias.c_str(), e.what());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to throw an unwanted error during WB from old versions which doesn't have this field in DB. Please change to DEBUG as this is not a real error scenario

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prsunny if the value is not there in the DB, flapCount default value = "0" (see line 5317 above)

orchagent/portsorch.cpp Show resolved Hide resolved
@prgeor prgeor force-pushed the portflap branch 2 times, most recently from e28c83d to b9fa264 Compare March 2, 2024 00:42
Signed-off-by: Prince George <prgeor@microsoft.com>
dgsudharsan
dgsudharsan previously approved these changes Mar 4, 2024
dgsudharsan
dgsudharsan previously approved these changes Mar 4, 2024
@prsunny
Copy link
Collaborator

prsunny commented Mar 5, 2024

@prgeor , seems its causing WB failures. Can you please check?

    # Only WARM_RESTART_TABLE|orchagent state=reconciled operation may exist after port oper status change.
  assert orchStateCount == 1

E assert 2 == 1
E -2
E +1

@prgeor
Copy link
Contributor Author

prgeor commented Mar 6, 2024

@prgeor , seems its causing WB failures. Can you please check?

    # Only WARM_RESTART_TABLE|orchagent state=reconciled operation may exist after port oper status change.
  assert orchStateCount == 1

E assert 2 == 1 E -2 E +1

@prgeor all tests now passed

@nazariig
Copy link
Collaborator

nazariig commented Mar 6, 2024

@prgeor why have to use APP DB, why not STATE DB?

@@ -2102,4 +2170,4 @@ namespace portsorch_test
ASSERT_FALSE(bridgePortCalledBeforeLagMember); // bridge port created on lag before lag member was created
}

}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor please add a new line

@@ -147,6 +147,8 @@ class PortsOrch : public Orch, public Subject

bool setHostIntfsOperStatus(const Port& port, bool up) const;
void updateDbPortOperStatus(const Port& port, sai_port_oper_status_t status) const;
void updateDbPortFlapCount(Port& port, sai_port_oper_status_t pstatus);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor please remove extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nazariig removed

orchagent/portsorch.cpp Outdated Show resolved Hide resolved
@prgeor
Copy link
Contributor Author

prgeor commented Mar 7, 2024

@prgeor why have to use APP DB, why not STATE DB?

@nazariig in order to reconcile the flap count after warm-reboot.

orchagent/port.h Outdated
@@ -164,6 +164,7 @@ class Port
uint32_t m_nat_zone_id = 0;
uint32_t m_vnid = VNID_NONE;
uint32_t m_fdb_count = 0;
uint32_t m_flap_count = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor why not uint64? Do we expect an overflow? Should be reflected in the log? Any reaction from the SWSS is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nazariig done

}
catch (const std::exception &e)
{
SWSS_LOG_ERROR("Failed to get port (%s) flap_count: %s", port.m_alias.c_str(), e.what());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor if we hit the exception, is it ok to still put the value to the DB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nazariig fixed

{
SWSS_LOG_ERROR("Failed to get port (%s) flap_count: %s", port.m_alias.c_str(), e.what());
}
m_portTable->hset(port.m_alias, "flap_count", flapCount);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor is this a typo? flapCount ->port.m_flap_count

Copy link
Contributor Author

@prgeor prgeor Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nazariig the intention is to write uint64_t string value to the DB

@prsunny prsunny merged commit 0c62091 into sonic-net:master Mar 12, 2024
14 checks passed
yejianquan pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Apr 3, 2024
…2290)

Approach
What is the motivation for this PR?
The swss introduce flap count and last flap time. The test fail because the two new fields mismatch. But these two fields are not affect the correctness of configlet test. sonic-net/sonic-swss#3052

How did you do it?
Add two fields to skip val.

How did you verify/test it?
E2E test using the sonic-swss updated image.

02:31:34 helpers.log_msg L0060 INFO | /var/src/sonic-mgmt-int/tests/configlet/util/common.py:359:02:31:34 patch_rm: compared dump state-db mismatch_cnt=0 msg=
02:31:34 helpers.log_msg L0060 INFO | /var/src/sonic-mgmt-int/tests/configlet/util/common.py:385:02:31:34 patch_rm: generic_patch_rm_t0: Succeeded to compare
02:31:34 helpers.log_msg L0060 INFO | /var/src/sonic-mgmt-int/tests/configlet/util/base_test.py:294:02:31:34 patch_rm: Test run is good!
PASSED [100%]
-------------------------------------------------------------------------------------------- live log teardown ---------------------------------------------------------------------------------------------

co-authorized by: jianquanye@microsoft.com
cscarpitta pushed a commit to cscarpitta/sonic-swss that referenced this pull request Apr 5, 2024
superchild pushed a commit to superchild/sonic-swss that referenced this pull request Apr 24, 2024
* Fixes mock test failure

* Fixes mock test run failure

fixes pipeline run failure

FAIL: p4orch_tests_usan
=======================

../../../orchagent/vrforch.cpp:113:41: runtime error: member call on
null pointer of type 'struct RouteOrch'
../../../orchagent/vrforch.cpp:113:41: runtime error: member access
within null pointer of type 'struct RouteOrch'
FAIL p4orch_tests_usan (exit status: 139)

* Fixed orchagent crash in VM with the Qos BUFFER_QUEUE|system-port|Queue-id-range config (sonic-net#3050)

* Fixed orchagent crash in VM with the Qos BUFFER_QUEUE|system-port|Queue-id-range config

* [intfsorch] Enable ipv6 proxy ndp along with proxy arp (sonic-net#3045)

* [intfsorch] Enable ipv6 proxy ndp along with proxy arp

setting SAI_VLAN_ATTR_UNKNOWN_MULTICAST_FLOOD_CONTROL_TYPE to
SAI_VLAN_FLOOD_CONTROL_TYPE_NONE when proxy arp is enabled. This fixes a
bug where ipv6 NS packets were flooding ports with duplicate packets. We
now set multicast flood type to none.

* Fix multi VLAN neighbor learning (sonic-net#3049)

What I did

When adding a new neighbor, check if the neighbor IP has already been learned on a different VLAN. If it has, remove the old neighbor entry before adding the new one.

Why I did it
On Gemini devices, if a neighbor IP moves from an active port in one VLAN to a second VLAN, then back to the first VLAN (with 3 different MAC addresses), orchagent will crash. Even though the MAC address of the last move is different from the first MAC address, orchagent believes the last MAC address to already be programmed in the hardware and tries to set an attribute of the entry which doesn't exist.

* [asan] Disable the "maybe-uninitialized" warning when compiled with ASAN enabled.

* Set HOST_TX_READY_NOTIFY attribute only after query capabilities(sonic-net#3070)

*Set HOST_TX_READY_NOTIFY attribute only after query capabilities

* [EVPN] Skip EVPN routes with invalid VNI or router mac field (sonic-net#3073)

* Skip EVPN routes with invalid VNI or router mac field

* Add port flap count and last flap timestamp to APPL_DB (sonic-net#3052)

* Add port flap count and last flap timestamp

* Add basic fabric link monitoring counters and states handling. (sonic-net#2988)

* Add basic fabric link monitoring counters and states handling.

* [Mellanox] Fix inconsistence in the shared headroom pool initialization (sonic-net#3057)

* Fix inconsistence in the shared headroom pool initialization

* Why I did it

During initialization, if SHP is enabled

the buffer pool sizes, xoff have initialized to 0, which means SHP is disabled
but the buffer profiles already indicate SHP
later on the buffer pool sizes are updated with off being non-zero
In case the orchagent starts handling buffer configuration between 2 and 3, it is inconsistent between buffer pools and profiles, which fails Mellanox SAI sanity check.
To avoid it, it indicates SHP enabled by setting a very small buffer pool and SHP sizes

* [acl] Add IN_PORTS qualifier for L3 table (sonic-net#3078)

* Apply IN_PORTS qualifiier for L3 table

Why I did it
IN_PORTS qualifier was allowed for L3 table in 202012 release and below. Changes in sonic-net#1982 removed that support leading to regression in some of our testcases. The following error was observed
ERR swss#orchagent: :- validateAclRuleMatch: Match SAI_ACL_ENTRY_ATTR_FIELD_IN_PORTS in rule RULE_1 is not supported by table DATAACL

* [bulker] add support for neighbor bulking (sonic-net#2768)

Adding support for sai_neighbor_api_t bulking in bulker.h

* [buffermgrd] Move switch-statement outside of if-statement in BufferMgr::doTask (sonic-net#3055)

* [buffermgr] Moved switch statement outside of if-statmement in Buffermgr::doTask

The switch statement which would normally erase buffer events was moved
to be inside the if-statement which would only enter if the event is a
SET event. This was introduced in commit e5329c39.

This would cause an infinite loop, since non-set events would never be
erased.

The switch statement has now been moved to occur outside the if,
allowing for non-set commands to be processed.

* [portsorch] process only updated APP_DB fields when port is already   created (sonic-net#3025)

* [portsorch] process only updated APP_DB fields when port is already created

What I did

Fixing an issue when setting some port attribute in APPL_DB triggers serdes parameters to be re-programmed with port toggling. Made portsorch to handle only those attributes that were pushed to APPL_DB, so that serdes programming happens only by xcvrd's request to do so.

* [Copp]Refactor coppmgr tests (sonic-net#3093)

What I did
Refactoring coppmgr mock tests

Why I did it
After migration to bookworm, coppmgr tests started failing due to the use of sudo commands.

* Revert "[acl] Add IN_PORTS qualifier for L3 table (sonic-net#3078)" (sonic-net#3092)

This reverts commit 9d4a3ad.
*Revert "[acl] Add IN_PORTS qualifier for L3 table"

* [orchagent] TWAMP Light orchagent implementation (sonic-net#2927)

* [orchagent] TWAMP Light orchagent implementation. (sonic-net#2927)
* What I did
Implemented the TWAMP Light feature according to the SONiC TWAMP Light HLD(sonic-net/SONiC#1320).

* Clang format change. (sonic-net#3080)

What I did
This PR has no real code change. It is purely clang formatting. It only applies to the P4Orch codes.
Commands that I run:
find orchagent/p4orch -name *.h -o -name .cpp | xargs clang-format -i -style="{BasedOnStyle: Microsoft, DerivePointerAlignment: false}"

find orchagent -name response_publisher -o -name return_code.h | xargs clang-format -i -style="{BasedOnStyle: Microsoft, DerivePointerAlignment: false}"

* T2-VOQ-VS: Fix iBGP bringup issue  (sonic-net#3053)

* Fix iBGP bringup issue T2-vswitch
* On T2-VOQ chassis Emulation with multi-asic linecards, iBGP sessions dont come up. Related Issue: sonic-net/sonic-buildimage#18129

* [Fdbsyncd] Adding extern_learn flag with fdb entry so Kernel doesn't age out (sonic-net#2985)

* Adding extern_learn flag with fdb entry so that Kernel doesn't age out the MAC

* [Fdbsyncd] Adding extern_learn flag with fdb entry so Kernel doesn't age out

What I did
extern_learn flag is added while programming the fdb entry into the Kernel. This will make sure that kernel doesn't age out the fdb entry. (#15004)

How I did it
A flag extern_learn will be passed while programing the fdb entry. (#15004)

How to verify it
Tested MAC add/del to the Kernel from the local FDB entry. (#15004)

Signed-off-by: kishore.kunal@broadcom.com

---------

Signed-off-by: kishore.kunal@broadcom.com
Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>

* Fix oper FEC retrieval after warmboot (sonic-net#3100)

Updating oper FEC status in state_db after warm-reboot as part of refresh port status call

* [EVPN]Fix fpmsyncd crash when EVPN type5 is received with bgp fib suppression enabled (sonic-net#3101)

* [EVPN]Fix fpmsyncd crash when EVPN type5 is received with bgp fib suppression enabled

* [portsorch] Handle TRANSCEIVER_INFO table on warm boot (sonic-net#3087)

* Add existing data from TRANSCEIVER_INFO table

* Introduce a new role for DPU-NPU Interconnect

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>

* [p4orch] Clang format change. (sonic-net#3096)

What I did
[p4orch]  This PR has no real code change. It is purely clang formatting. 
It does the same as sonic-net#3080.

* [dash] fix ENI admin state update (sonic-net#3081)

* [dash] fix ENI admin state update

* Add force option for fabric port unisolate command (sonic-net#3089)

What I did
Add force option to the unisolate link command, so users can make the links not isolate if they want.
depends on sonic-net/sonic-buildimage#18447

* [twamporch] Explicitly initialize local variable (sonic-net#3115)

What I did
Explicitly initialized local variable.

Why I did it
We met below error message in sonic-buildimage armhf build (sonic-net/sonic-buildimage#18334)

* Add bookworm build to the PR checkers (sonic-net#3114)

What I did
Add a Bookworm build to the PR checkers. Also fix some Bookworm build errors that crept in.

Why I did it
Buildimage now builds swss for Bookworm, so the build needs to succeed.

* [ACL] Remove flex counter when updating ACL rule (sonic-net#3118)

What I did
This PR is to fix sonic-net/sonic-buildimage#18719

When ACL rule is created for the first time, a flex counter is created and registered. When the same ACL rule is being updated, the FlexCounter created before is not removed, and another FlexCounter is created and registered.

Why I did it
Fix the issue that FlexCounter is duplicated when updating existing ACL rule.

---------

Signed-off-by: kishore.kunal@broadcom.com
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Co-authored-by: saksarav-nokia <sakthivadivu.saravanaraj@nokia.com>
Co-authored-by: Nikola Dancejic <26731235+Ndancejic@users.noreply.github.com>
Co-authored-by: Lawrence Lee <lawlee@microsoft.com>
Co-authored-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
Co-authored-by: noaOrMlnx <58519608+noaOrMlnx@users.noreply.github.com>
Co-authored-by: Lior Avramov <73036155+liorghub@users.noreply.github.com>
Co-authored-by: Prince George <45705344+prgeor@users.noreply.github.com>
Co-authored-by: jfeng-arista <98421150+jfeng-arista@users.noreply.github.com>
Co-authored-by: Stephen Sun <5379172+stephenxs@users.noreply.github.com>
Co-authored-by: Neetha John <nejo@microsoft.com>
Co-authored-by: Amir <mazora@marvell.com>
Co-authored-by: Stepan Blyshchak <38952541+stepanblyschak@users.noreply.github.com>
Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
Co-authored-by: xiaodong hu <32903206+huseratgithub@users.noreply.github.com>
Co-authored-by: mint570 <70396898+mint570@users.noreply.github.com>
Co-authored-by: Deepak Singhal <115033986+deepak-singhal0408@users.noreply.github.com>
Co-authored-by: KISHORE KUNAL <64033340+kishorekunal01@users.noreply.github.com>
Co-authored-by: Vivek <vivekreddykarri98@gmail.com>
Co-authored-by: Yakiv Huryk <62013282+Yakiv-Huryk@users.noreply.github.com>
Co-authored-by: Saikrishna Arcot <sarcot@microsoft.com>
Co-authored-by: bingwang-ms <66248323+bingwang-ms@users.noreply.github.com>
mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this pull request Jul 31, 2024
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #3248

mssonicbld pushed a commit that referenced this pull request Jul 31, 2024
* Add port flap count and last flap timestamp
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Aug 5, 2024
…nic-net#12290)

Approach
What is the motivation for this PR?
The swss introduce flap count and last flap time. The test fail because the two new fields mismatch. But these two fields are not affect the correctness of configlet test. sonic-net/sonic-swss#3052

How did you do it?
Add two fields to skip val.

How did you verify/test it?
E2E test using the sonic-swss updated image.

02:31:34 helpers.log_msg L0060 INFO | /var/src/sonic-mgmt-int/tests/configlet/util/common.py:359:02:31:34 patch_rm: compared dump state-db mismatch_cnt=0 msg=
02:31:34 helpers.log_msg L0060 INFO | /var/src/sonic-mgmt-int/tests/configlet/util/common.py:385:02:31:34 patch_rm: generic_patch_rm_t0: Succeeded to compare
02:31:34 helpers.log_msg L0060 INFO | /var/src/sonic-mgmt-int/tests/configlet/util/base_test.py:294:02:31:34 patch_rm: Test run is good!
PASSED [100%]
-------------------------------------------------------------------------------------------- live log teardown ---------------------------------------------------------------------------------------------

co-authorized by: jianquanye@microsoft.com
mssonicbld pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Aug 5, 2024
…2290)

Approach
What is the motivation for this PR?
The swss introduce flap count and last flap time. The test fail because the two new fields mismatch. But these two fields are not affect the correctness of configlet test. sonic-net/sonic-swss#3052

How did you do it?
Add two fields to skip val.

How did you verify/test it?
E2E test using the sonic-swss updated image.

02:31:34 helpers.log_msg L0060 INFO | /var/src/sonic-mgmt-int/tests/configlet/util/common.py:359:02:31:34 patch_rm: compared dump state-db mismatch_cnt=0 msg=
02:31:34 helpers.log_msg L0060 INFO | /var/src/sonic-mgmt-int/tests/configlet/util/common.py:385:02:31:34 patch_rm: generic_patch_rm_t0: Succeeded to compare
02:31:34 helpers.log_msg L0060 INFO | /var/src/sonic-mgmt-int/tests/configlet/util/base_test.py:294:02:31:34 patch_rm: Test run is good!
PASSED [100%]
-------------------------------------------------------------------------------------------- live log teardown ---------------------------------------------------------------------------------------------

co-authorized by: jianquanye@microsoft.com
@justin-wong-ce
Copy link

The recent backport to 202311 caused a regression on 202311.

The backported commit had an issue that was previously fixed with #3111 on master, which was not backported. Making a backport request to backport #3111 to 202311.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants