-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syncd APPLY_VIEW failure causes Orchagent crash after warm reboot #6069
Comments
@vaibhavhd I will need full recordings both on cold boot and warm boot. there are some operations on ASIC: @qiluo-msft It could be related to OA, but i will need logs to confirm that |
from attached logs, seems like issues is when removing trap, this can happen if orchagent after warm boot didn't generated traps, or traps were not matched on comparison logic, or trap is actually in use on SAI_HOSTIF_TABLE_ENTRY_ATTR_TRAP_ID. |
also use of SAI_HOSTIF_TABLE_ENTRY_ATTR_TRAP_ID was recently added in OA (7b76d2e (Sudharsan Dhamal Gopalarathnam 2020-11-11 11:47:11 -0800) |
@shi-su , does this affect virtual switch warm reboot for master branch? |
@kcudnik I have attached logs at https://github.com/Azure/sonic-buildimage/files/5623970/oacrash.zip |
@vaibhavhd , I think you can do analysis on the failure. during warm reboot, the expectation is that no change operation will be generated. if we are generating any change event, some thing is wrong in the first place. maybe you can do more analysis and figure out the root cause of this issue instead of relying kamil for the analysis. |
it looks like the root cause is step 3, the remove hostif trap failed, and it should not. This is a broadcom SAI issue. @gechiang , can you track this SAI issue? I believer there is already a broadcom SAI issue opened. |
from the attached log
there are 56 operations that were generated to be performed on the ASIC, so there is definitely something was changed, it could be related to the sonic-swss commit (7b76d2e ) that changed hostif trap removal |
Sure Guohan. Also, I just tried on VS testbed. The issue is not seen on the latest image for virtual switch. |
I looked briefly to brcm source code, and they track references for hostif_trap_group, and in this case they returned SAI_STATUS_OBJECT_IN_USE is when trap is installed, but i don't know that that internally means, and internally installed is set to FALSE in very convoluted way, im not able to tell when this is happening, and i could suggest that removal of hostif_trap is in use on the attribute i mentioned SAI_HOSTIF_TABLE_ENTRY_ATTR_TRAP_ID. To confirm this i will analyse logs provided There is specific logic for hostif_trap_group (for default one) in comparison logic, but not for hostif_trap, this is generic case like for any other object. Error could be by comparison logic issue, but i'm 99% sure that is not the case, also taking into account that we had many issues in brcm SAI implementation in the past related to warm boot. I will dive into logs and see if i can spot anything interesting and let you know |
If it's not seen on VS, then it's most likely brcm issue, since in VS case references are counted both on syncd/VS side and on sairedis side, so if there would be some issue with reference count (object in use error) then it would show up on VS as well. Also do you get that trap error on brcm hardware consistently ? time after time ? |
The trap error is consistent on every warm-reboot. I am doing binary search, and found that 503 did not have this error. I will post more information if I find the change which caused this.
|
Sorry, I spoke too soon regarding 503 being good. Image 503 also has the same issue, it is just that Bgp neighbors are maintained for longer period of time than the latest. In 507, the neighborship was impacted within 90s of warm reboot. But, in 503, the neighborship goes down after around 4 mins. Upon checking the logs, the error "remove hostif trap failed with error" is seen too. |
i grepped logs and SAI_HOSTIF_TABLE_ENTRY_ATTR_TRAP_ID is out of the question, since it's not present in recordings, so that's ruled out i made syslog analysis attached, and 36 are just by setting fdb learning mode to HW (that's normal, since before warm boot OA disables learning mode, to not loose any fdb learning notification during warm boot), and other operations:
all 11 hostif_traps are removed, 4 hosif trap groups are removed and 1 trap group is updated (probably default trap group), and 3 policers are removed, none of those objects are requested to be created, after warm boot
from sairedis recording, OA didn't recreated traps, trap groups and policers after warm boot, that's why syncd attempted to remove them, this indicates that there is OA bug related to traps/policer/trap groups as well as a bug in brcm SAI that fails to remove trap which is not used as a side note, we have some numbers how long it takes to create 10k routes on this particular platform and cpu:
so it takes 2 seconds for 10k routes to create, but keep in mind that this is 1 by 1 not bulk, and database write takes 600ms for 10k side note 2:
switch remove for warm boot takes 7 seconds, this probably internally dumps some vendor database to disk/switch hw etc, but quite a time, i think we should not worry about this |
@dgsudharsan for visibility. This is an issue with removing IP2ME trap which is being tracked with Brcm |
The I2ME trap BRCM issue is tracked with CS00011466224 |
Tested on latest of Master. This issue is not seen anymore and the fix is provided by new SAI patch - #6244 Tested:
|
oacrash.zip
Description
Orchagent crashes while in APPLY_VIEW:
APPLY_VIEW
.APPLY_VIEW
, SAI APISAI_API_HOSTIF:brcm_sai_remove_hostif_trap
fails withremove hostif trap failed with error -17
syncd
finds thatASIC is in inconsistent state
due to the status beingSAI_STATUS_OBJECT_IN_USE
syncd
sendssendShutdownRequest
to orchagent.Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
No Orchagent crash and BGP neighbors remain established.
Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: