Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncd APPLY_VIEW failure causes Orchagent crash after warm reboot #6069

Closed
vaibhavhd opened this issue Dec 1, 2020 · 17 comments
Closed

Syncd APPLY_VIEW failure causes Orchagent crash after warm reboot #6069

vaibhavhd opened this issue Dec 1, 2020 · 17 comments
Assignees
Labels
Master Branch Quality P0 Priority of the issue

Comments

@vaibhavhd
Copy link
Contributor

vaibhavhd commented Dec 1, 2020

oacrash.zip

Description

Orchagent crashes while in APPLY_VIEW:

  1. DUT undergoes a warm-boot
  2. Orchagent state restore finishes. OA notifies syncd to APPLY_VIEW.
  3. In APPLY_VIEW, SAI APISAI_API_HOSTIF:brcm_sai_remove_hostif_trap fails with remove hostif trap failed with error -17
  4. syncd finds that ASIC is in inconsistent state due to the status being SAI_STATUS_OBJECT_IN_USE
  5. syncd sends sendShutdownRequest to orchagent.

Steps to reproduce the issue:

  1. Install latest SONiC. (If in latest already, simply warm-reboot).
  2. Warm reboot.
  3. Soon after warm-reboot, the BGP neighbors become not established. From syslog errors are seen.

Describe the results you received:

Nov 30 16:27:47.150401 str-7260cx3-acs-7 WARNING syncd#syncd: :- processFlexCounterEvent: port VID oid:0x18000000000d33, was not found (probably port was removed/splitted) and will remove from counters now
Nov 30 16:27:47.150636 str-7260cx3-acs-7 NOTICE syncd#syncd: :- removeBufferPool: Trying to remove nonexisting buffer pool 0x18000000000d33 from flex counter BUFFER_POOL_WATERMARK_STAT_COUNTER
Nov 30 16:27:47.794572 str-7260cx3-acs-7 NOTICE swss#orchagent: :- postBake: Start MirrorOrch post-baking
Nov 30 16:27:47.794572 str-7260cx3-acs-7 NOTICE swss#orchagent: :- setWarmStartState: orchagent warm start state changed to restored
Nov 30 16:27:47.794572 str-7260cx3-acs-7 NOTICE swss#orchagent: :- warmRestoreAndSyncUp: Orchagent state restore done
Nov 30 16:27:47.794572 str-7260cx3-acs-7 NOTICE swss#orchagent: :- syncd_apply_view: Notify syncd APPLY_VIEW
Nov 30 16:27:47.794572 str-7260cx3-acs-7 NOTICE swss#orchagent: :- notifySyncd: sending syncd: APPLY_VIEW

Nov 30 16:27:48.804081 str-7260cx3-acs-7 NOTICE syncd#syncd: :- asicGetWithOptimizedRemoveOperations: moved 12 REMOVE operations upper in stack from total 56 operations
Nov 30 16:27:48.804081 str-7260cx3-acs-7 NOTICE syncd#syncd: :- asicGetWithOptimizedRemoveOperations: optimizing asic remove operations took 0.000022 sec
Nov 30 16:27:48.805116 str-7260cx3-acs-7 WARNING syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_set_cpu_queue_shaper:11856 Set CPU Queue 0 shaping: cir 0, cbs 0
Nov 30 16:27:48.805207 str-7260cx3-acs-7 ERR syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_remove_hostif_trap:6682 trap id 16384 in use
Nov 30 16:27:48.805207 str-7260cx3-acs-7 ERR syncd#syncd: [0] SAI_API_HOSTIF:brcm_sai_remove_hostif_trap:1581 remove hostif trap failed with error -17.
Nov 30 16:27:48.805207 str-7260cx3-acs-7 ERR syncd#syncd: [0] SAI_API_HOSTIF:brcm_sai_remove_hostif_trap:1581 remove hostif trap failed with error -17.
Nov 30 16:27:48.805218 str-7260cx3-acs-7 ERR syncd#syncd: :- asic_handle_generic: remove SAI_OBJECT_TYPE_HOSTIF_TRAP RID: oid:0x2200000002 VID oid:0x22000000000bf2 failed: SAI_STATUS_OBJECT_IN_USE
Nov 30 16:27:48.805230 str-7260cx3-acs-7 ERR syncd#syncd: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_HOSTIF_TRAP:oid:0x22000000000bf2, status: SAI_STATUS_OBJECT_IN_USE

Nov 30 16:27:48.805328 str-7260cx3-acs-7 ERR syncd#syncd: :- executeOperationsOnAsic: Error while executing asic operations, ASIC is in inconsistent state: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_HOSTIF_TRAP:oid:0x22000000000bf2, status: SAI_STATUS_OBJECT_IN_USE
Nov 30 16:27:48.805328 str-7260cx3-acs-7 ERR syncd#syncd: :- executeOperationsOnAsic: Error while executing asic operations, ASIC is in inconsistent state: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_HOSTIF_TRAP:oid:0x22000000000bf2, status: SAI_STATUS_OBJECT_IN_USE
Nov 30 16:27:48.916476 str-7260cx3-acs-7 ERR syncd#syncd: :- run: Runtime error: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_HOSTIF_TRAP:oid:0x22000000000bf2, status: SAI_STATUS_OBJECT_IN_USE
Nov 30 16:27:48.916476 str-7260cx3-acs-7 ERR syncd#syncd: :- run: Runtime error: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_HOSTIF_TRAP:oid:0x22000000000bf2, status: SAI_STATUS_OBJECT_IN_USE

Nov 30 16:27:48.916476 str-7260cx3-acs-7 NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x21000000000000
Nov 30 16:27:48.916867 str-7260cx3-acs-7 ERR swss#orchagent: :- syncd_apply_view: Failed to notify syncd APPLY_VIEW -1
Nov 30 16:27:48.917083 str-7260cx3-acs-7 NOTICE swss#orchagent: :- uninitialize: begin
Nov 30 16:27:48.917083 str-7260cx3-acs-7 NOTICE swss#orchagent: :- uninitialize: begin
Nov 30 16:27:48.917083 str-7260cx3-acs-7 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread begin
Nov 30 16:27:48.917165 str-7260cx3-acs-7 NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfull
Nov 30 16:27:48.917181 str-7260cx3-acs-7 ERR swss#orchagent: :- on_switch_shutdown_request: Syncd stopped

Nov 30 16:27:49.930274 str-7260cx3-acs-7 INFO swss#/supervisor-proc-exit-listener: Process orchagent exited unxepectedly. Terminating supervisor...

Describe the results you expected:
No Orchagent crash and BGP neighbors remain established.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@kcudnik
Copy link
Contributor

kcudnik commented Dec 1, 2020

@vaibhavhd I will need full recordings both on cold boot and warm boot. there are some operations on ASIC:
moved 12 REMOVE operations upper in stack from total 56 operations
so the state has been modified, and i will need those recordings to find out what's happening.
also i will need full syslog from that warm boot

@qiluo-msft It could be related to OA, but i will need logs to confirm that

@kcudnik
Copy link
Contributor

kcudnik commented Dec 1, 2020

from attached logs, seems like issues is when removing trap, this can happen if orchagent after warm boot didn't generated traps, or traps were not matched on comparison logic, or trap is actually in use on SAI_HOSTIF_TABLE_ENTRY_ATTR_TRAP_ID.
But i need logs and recordings to verify this

@kcudnik
Copy link
Contributor

kcudnik commented Dec 1, 2020

also use of SAI_HOSTIF_TABLE_ENTRY_ATTR_TRAP_ID was recently added in OA (7b76d2e (Sudharsan Dhamal Gopalarathnam 2020-11-11 11:47:11 -0800)

@lguohan
Copy link
Collaborator

lguohan commented Dec 1, 2020

@shi-su , does this affect virtual switch warm reboot for master branch?

@vaibhavhd
Copy link
Contributor Author

@lguohan
Copy link
Collaborator

lguohan commented Dec 1, 2020

@vaibhavhd , I think you can do analysis on the failure. during warm reboot, the expectation is that no change operation will be generated. if we are generating any change event, some thing is wrong in the first place. maybe you can do more analysis and figure out the root cause of this issue instead of relying kamil for the analysis.

@lguohan
Copy link
Collaborator

lguohan commented Dec 1, 2020

it looks like the root cause is step 3, the remove hostif trap failed, and it should not. This is a broadcom SAI issue. @gechiang , can you track this SAI issue? I believer there is already a broadcom SAI issue opened.

@kcudnik
Copy link
Contributor

kcudnik commented Dec 1, 2020

from the attached log

asicGetWithOptimizedRemoveOperations: moved 12 REMOVE operations upper in stack from total 56 operations

there are 56 operations that were generated to be performed on the ASIC, so there is definitely something was changed, it could be related to the sonic-swss commit (7b76d2e ) that changed hostif trap removal

@vaibhavhd
Copy link
Contributor Author

Sure Guohan. Also, I just tried on VS testbed. The issue is not seen on the latest image for virtual switch.

@kcudnik
Copy link
Contributor

kcudnik commented Dec 1, 2020

it looks like the root cause is step 3, the remove hostif trap failed, and it should not. This is a broadcom SAI issue. @gechiang , can you track this SAI issue? I believer there is already a broadcom SAI issue opened.

I looked briefly to brcm source code, and they track references for hostif_trap_group, and in this case they returned SAI_STATUS_OBJECT_IN_USE is when trap is installed, but i don't know that that internally means, and internally installed is set to FALSE in very convoluted way, im not able to tell when this is happening, and i could suggest that removal of hostif_trap is in use on the attribute i mentioned SAI_HOSTIF_TABLE_ENTRY_ATTR_TRAP_ID. To confirm this i will analyse logs provided

There is specific logic for hostif_trap_group (for default one) in comparison logic, but not for hostif_trap, this is generic case like for any other object. Error could be by comparison logic issue, but i'm 99% sure that is not the case, also taking into account that we had many issues in brcm SAI implementation in the past related to warm boot.

I will dive into logs and see if i can spot anything interesting and let you know

@kcudnik
Copy link
Contributor

kcudnik commented Dec 1, 2020

Sure Guohan. Also, I just tried on VS testbed. The issue is not seen on the latest image for virtual switch.

If it's not seen on VS, then it's most likely brcm issue, since in VS case references are counted both on syncd/VS side and on sairedis side, so if there would be some issue with reference count (object in use error) then it would show up on VS as well.

Also do you get that trap error on brcm hardware consistently ? time after time ?

@vaibhavhd
Copy link
Contributor Author

Sure Guohan. Also, I just tried on VS testbed. The issue is not seen on the latest image for virtual switch.

If it's not seen on VS, then it's most likely brcm issue, since in VS case references are counted both on syncd/VS side and on sairedis side, so if there would be some issue with reference count (object in use error) then it would show up on VS as well.

Also do you get that trap error on brcm hardware consistently ? time after time ?

The trap error is consistent on every warm-reboot. I am doing binary search, and found that 503 did not have this error. I will post more information if I find the change which caused this.

$ sudo zgrep -a "remove hostif trap failed with error" ./syslog*
./syslog:Dec  1 19:06:31.916219 str2-7260cx3-acs-9 ERR syncd#syncd: [0] SAI_API_HOSTIF:brcm_sai_remove_hostif_trap:1581 remove hostif trap failed with error -17.
./syslog.1:Dec  1 18:06:43.349429 str2-7260cx3-acs-9 ERR syncd#syncd: [0] SAI_API_HOSTIF:brcm_sai_remove_hostif_trap:1581 remove hostif trap failed with error -17.
./syslog.2.gz:Dec  1 17:35:38.562958 str2-7260cx3-acs-9 ERR syncd#syncd: [0] SAI_API_HOSTIF:brcm_sai_remove_hostif_trap:1581 remove hostif trap failed with error -17.

@vaibhavhd
Copy link
Contributor Author

Sorry, I spoke too soon regarding 503 being good. Image 503 also has the same issue, it is just that Bgp neighbors are maintained for longer period of time than the latest. In 507, the neighborship was impacted within 90s of warm reboot.

But, in 503, the neighborship goes down after around 4 mins. Upon checking the logs, the error "remove hostif trap failed with error" is seen too.

@kcudnik
Copy link
Contributor

kcudnik commented Dec 1, 2020

i grepped logs and SAI_HOSTIF_TABLE_ENTRY_ATTR_TRAP_ID is out of the question, since it's not present in recordings, so that's ruled out

i made syslog analysis attached, and 36 are just by setting fdb learning mode to HW (that's normal, since before warm boot OA disables learning mode, to not loose any fdb learning notification during warm boot), and other operations:

15588 Nov 30 16:27:48.804053 str-7260cx3-acs-7 NOTICE syncd#syncd: :- executeOperationsOnAsic: operations on SAI_OBJECT_TYPE_HOSTIF_TRAP: 11
15589 Nov 30 16:27:48.804053 str-7260cx3-acs-7 NOTICE syncd#syncd: :- executeOperationsOnAsic: operations on SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP: 5
15590 Nov 30 16:27:48.804069 str-7260cx3-acs-7 NOTICE syncd#syncd: :- executeOperationsOnAsic: operations on SAI_OBJECT_TYPE_POLICER: 3

all 11 hostif_traps are removed, 4 hosif trap groups are removed and 1 trap group is updated (probably default trap group), and 3 policers are removed, none of those objects are requested to be created, after warm boot
order of api execution is also correct, so there is no issue with reference count on sairedis side.

2484 Nov 30 16:27:48.649985 str-7260cx3-acs-7 WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP on current view 5 is differnt than on temporary view: 1
2485 Nov 30 16:27:48.649985 str-7260cx3-acs-7 WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_POLICER on current view 3 is differnt than on temporary view: 0
2486 Nov 30 16:27:48.650753 str-7260cx3-acs-7 WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP on current view 12 is differnt than on temporary view: 1

from sairedis recording, OA didn't recreated traps, trap groups and policers after warm boot, that's why syncd attempted to remove them, this indicates that there is OA bug related to traps/policer/trap groups as well as a bug in brcm SAI that fails to remove trap which is not used

as a side note, we have some numbers how long it takes to create 10k routes on this particular platform and cpu:

 91 Nov 30 16:22:11.258453 str-7260cx3-acs-7 NOTICE syncd#syncd: :- inc: 10000 (calls 10000) Syncd::processBulkEntry::processEntry(route_entry) CREATE op took: 2102 ms
 92 Nov 30 16:22:13.912334 str-7260cx3-acs-7 NOTICE syncd#syncd: :- inc: 11684 (calls 22) Syncd::syncUpdateRedisBulkQuadEvent op took: 601 ms

so it takes 2 seconds for 10k routes to create, but keep in mind that this is 1 by 1 not bulk, and database write takes 600ms for 10k

side note 2:

Nov 30 16:26:48.596003 str-7260cx3-acs-7 NOTICE syncd#syncd: :- removeAllSwitches: removing switch RID oid:0xb971112100000000 took 7.155058 sec

switch remove for warm boot takes 7 seconds, this probably internally dumps some vendor database to disk/switch hw etc, but quite a time, i think we should not worry about this

@prsunny
Copy link
Contributor

prsunny commented Dec 2, 2020

@dgsudharsan for visibility. This is an issue with removing IP2ME trap which is being tracked with Brcm

@gechiang
Copy link
Collaborator

gechiang commented Dec 2, 2020

The I2ME trap BRCM issue is tracked with CS00011466224

@vaibhavhd
Copy link
Contributor Author

Tested on latest of Master. This issue is not seen anymore and the fix is provided by new SAI patch - #6244

Tested:

SONiC Software Version: SONiC.HEAD.163-ee8c3d34
Distribution: Debian 10.7
Kernel: 4.19.0-9-2-amd64
Build commit: ee8c3d34
Build date: Mon Dec 21 07:35:11 UTC 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Master Branch Quality P0 Priority of the issue
Projects
None yet
Development

No branches or pull requests

6 participants