Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fast-reboot] LAGs go down when SIGINT is sent to teamd in fast-reboot #649

Closed
stepanblyschak opened this issue Sep 17, 2019 · 1 comment

Comments

@stepanblyschak
Copy link
Contributor

Description

Steps to reproduce the issue

  1. show log -f
  2. sudo fast-reboot -v
  3. observe in logs the following
ep 17 12:42:54.172241 r-boxer-sw01 NOTICE admin: Stopping bgp ...                                                                                                                                                               [140/2379]
Sep 17 12:42:54.492741 r-boxer-sw01 INFO bgp#supervisord: fpmsyncd Connection lost, reconnecting...
Sep 17 12:42:54.492741 r-boxer-sw01 INFO bgp#supervisord: fpmsyncd Waiting for fpm-client connection...
Sep 17 12:42:54.950124 r-boxer-sw01 NOTICE syncd#syncd: :- saiUpdateSupportedQueueCounters: QUEUE_STAT_COUNTER: counter SAI_QUEUE_STAT_BYTES is not supported on queue oid:0x12700000e0015, rv: SAI_STATUS_ATTR_NOT_SUPPORTED_0
Sep 17 12:42:54.950124 r-boxer-sw01 NOTICE syncd#syncd: :- saiUpdateSupportedQueueCounters: QUEUE_STAT_COUNTER: counter SAI_QUEUE_STAT_DROPPED_PACKETS is not supported on queue oid:0x12700000e0015, rv: SAI_STATUS_ATTR_NOT_SUPPORTED_0
Sep 17 12:42:54.951646 r-boxer-sw01 NOTICE syncd#syncd: :- saiUpdateSupportedQueueCounters: QUEUE_STAT_COUNTER: counter SAI_QUEUE_STAT_DROPPED_BYTES is not supported on queue oid:0x12700000e0015, rv: SAI_STATUS_ATTR_NOT_SUPPORTED_0
Sep 17 12:42:55.026821 r-boxer-sw01 NOTICE admin: Stopped  bgp ...
Sep 17 12:42:55.952610 r-boxer-sw01 NOTICE syncd#syncd: :- saiUpdateSupportedQueueCounters: QUEUE_WATERMARK_STAT_COUNTER: counter SAI_QUEUE_STAT_SHARED_WATERMARK_BYTES is not supported on queue oid:0x12700000e0015, rv: SAI_STATUS_ATTR_
NOT_SUPPORTED_0
Sep 17 12:42:55.952610 r-boxer-sw01 NOTICE syncd#syncd: :- setQueueCounterList: QUEUE_WATERMARK_STAT_COUNTER: queue oid:0x12700000e0015 does not has supported counters
Sep 17 12:42:56.018265 r-boxer-sw01 INFO dockerd[525]: time="2019-09-17T12:42:56.018093417Z" level=info msg="shim reaped" id=3b9a41d17ab49bcf255a6e355c9fcb1d4c0aa6a6c07e859233ae4b22e0d3c94a
Sep 17 12:42:56.028599 r-boxer-sw01 INFO dockerd[525]: time="2019-09-17T12:42:56.028392681Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 17 12:42:56.109064 r-boxer-sw01 INFO lldp.sh[6717]: 137
Sep 17 12:42:56.208734 r-boxer-sw01 INFO lldp.sh[18890]: lldp
Sep 17 12:42:56.217171 r-boxer-sw01 INFO systemd[1]: Stopped LLDP container.
Sep 17 12:42:56.295917 r-boxer-sw01 INFO bgp#supervisord 2019-09-17 12:42:54,491 INFO exited: zebra (terminated by SIGKILL; not expected)
Sep 17 12:42:56.295917 r-boxer-sw01 INFO bgp#supervisord 2019-09-17 12:42:54,926 INFO exited: bgpd (terminated by SIGKILL; not expected)
Sep 17 12:42:56.528306 r-boxer-sw01 INFO kernel: [  697.339712] sx_netdev: sx_netdev_stop: called
Sep 17 12:42:56.528344 r-boxer-sw01 INFO kernel: [  697.339720] sx_netdev: sx_netdev_stop: exit
Sep 17 12:42:56.541209 r-boxer-sw01 INFO kernel: [  697.357048] PortChannel0003: Port device Ethernet120 removed
Sep 17 12:42:56.541242 r-boxer-sw01 INFO kernel: [  697.357714] sx_netdev: sx_netdev_stop: called
Sep 17 12:42:56.541247 r-boxer-sw01 INFO kernel: [  697.357722] sx_netdev: sx_netdev_stop: exit
Sep 17 12:42:56.562515 r-boxer-sw01 INFO kernel: [  697.378628] PortChannel0001: Port device Ethernet112 removed
Sep 17 12:42:56.567387 r-boxer-sw01 INFO kernel: [  697.380772] sx_netdev: sx_netdev_stop: called
Sep 17 12:42:56.567418 r-boxer-sw01 INFO kernel: [  697.380776] sx_netdev: sx_netdev_stop: exit
Sep 17 12:42:56.583361 r-boxer-sw01 INFO kernel: [  697.400078] PortChannel0004: Port device Ethernet124 removed
Sep 17 12:42:56.589663 r-boxer-sw01 INFO kernel: [  697.400405] sx_netdev: sx_netdev_stop: called
Sep 17 12:42:56.589696 r-boxer-sw01 INFO kernel: [  697.400409] sx_netdev: sx_netdev_stop: exit
Sep 17 12:42:56.599531 r-boxer-sw01 INFO kernel: [  697.416067] PortChannel0002: Port device Ethernet116 removed
Sep 17 12:42:56.601357 r-boxer-sw01 NOTICE swss#orchagent: :- removeLagMember: Remove member Ethernet120 from LAG PortChannel0003 lid:200000000059f lmid:1b000000000630
Sep 17 12:42:56.603441 r-boxer-sw01 INFO syncd#supervisord: syncd Sep 17 12:42:56 NOTICE  SAI_LAG: mlnx_sai_lag.c[1313]- mlnx_remove_lag_member: Removing LAG member (10d00,2,0)
Sep 17 12:42:56.605346 r-boxer-sw01 NOTICE swss#orchagent: :- removeLagMember: Remove member Ethernet112 from LAG PortChannel0001 lid:2000000000595 lmid:1b00000000062d
Sep 17 12:42:56.605991 r-boxer-sw01 DEBUG kernel: [  697.421755] sx_core: sx_core_dispatch_event: dispatching the event
Sep 17 12:42:56.606014 r-boxer-sw01 INFO kernel: [  697.421761] sx_netdev: sx_netdev_event: Got LAG_OPER_STATE_UPDATE event. dev = ffffbd67494dc000, dev_id = 1, lag_id = 2, oper_state = 0
Sep 17 12:42:56.606018 r-boxer-sw01 WARNING kernel: [  697.421764] sx_netdev_set_lag_oper_state: Called for lag_id 2  status DOWN
Sep 17 12:42:56.608450 r-boxer-sw01 NOTICE swss#orchagent: :- removeLagMember: Remove member Ethernet124 from LAG PortChannel0004 lid:20000000005a4 lmid:1b00000000062e
Sep 17 12:42:56.609927 r-boxer-sw01 WARNING swss#orchagent: :- doLagMemberTask: Member Ethernet124 not found in LAG PortChannel0004 lid:20000000005a4 lmid:0,
Sep 17 12:42:56.612668 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0003 MTU to 9100
Sep 17 12:42:56.614502 r-boxer-sw01 NOTICE swss#orchagent: :- removeLagMember: Remove member Ethernet116 from LAG PortChannel0002 lid:200000000059a lmid:1b00000000062f
Sep 17 12:42:56.618191 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0001 MTU to 9100
Sep 17 12:42:56.629353 r-boxer-sw01 ERR ntpd[16324]: routing socket reports: No buffer space available
Sep 17 12:42:57.397371 r-boxer-sw01 NOTICE syncd#syncd: :- threadFunction:  span < 0 = -298126355 at 1568724177386873
Sep 17 12:42:57.397371 r-boxer-sw01 NOTICE syncd#syncd: :- threadFunction:  new span  = 784626
Sep 17 12:42:57.609191 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0004 MTU to 9100
Sep 17 12:42:57.620070 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0002 MTU to 9100
Sep 17 12:42:57.621860 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0001 MTU to 9100
Sep 17 12:42:57.621860 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0003 MTU to 9100
Sep 17 12:42:57.623102 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0002 MTU to 9100
Sep 17 12:42:57.623102 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0004 MTU to 9100
Sep 17 12:42:57.624649 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0001 MTU to 9100
Sep 17 12:42:57.624649 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0004 MTU to 9100
Sep 17 12:42:57.629908 r-boxer-sw01 NOTICE swss#orchagent: :- setRouterIntfsMtu: Set router interface PortChannel0004 MTU to 9100
Sep 17 12:42:57.629908 r-boxer-sw01 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0004
Sep 17 12:42:57.636164 r-boxer-sw01 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0004

Traffic stops working immediately after LAGs go down.

Describe the results you received
LAGs go down when SIGINT is sent to teamd. Traffic stops working.

Describe the results you expected
LAG should not go down so early in fast reboot sequence

Additional information you deem important (e.g. issue happens only occasionally)

Output of show version

SONiC Software Version: SONiC.HEAD.79-75104bb3
Distribution: Debian 9.11
Kernel: 4.9.0-9-2-amd64
Build commit: 75104bb3
Build date: Mon Sep 16 09:20:42 UTC 2019
Built by: johnar@jenkins-worker-3

Platform: x86_64-mlnx_msn2010-r0
HwSKU: ACS-MSN2010
ASIC: mellanox
Serial Number: MT1749X10061
Uptime: 12:45:36 up 2 min,  1 user,  load average: 6.52, 3.15, 1.21

Docker images:
REPOSITORY                 TAG                 IMAGE ID            SIZE
docker-syncd-mlnx          HEAD.79-75104bb3    d12be7902ea4        369MB
docker-syncd-mlnx          latest              d12be7902ea4        369MB
docker-lldp-sv2            HEAD.79-75104bb3    9a631e93be43        298MB
docker-lldp-sv2            latest              9a631e93be43        298MB
docker-router-advertiser   HEAD.79-75104bb3    fd2c06f88d27        281MB
docker-router-advertiser   latest              fd2c06f88d27        281MB
docker-dhcp-relay          HEAD.79-75104bb3    9f7e0a6bfdfd        289MB
docker-dhcp-relay          latest              9f7e0a6bfdfd        289MB
docker-database            HEAD.79-75104bb3    12dd426bb82e        281MB
docker-database            latest              12dd426bb82e        281MB
docker-sflow               HEAD.79-75104bb3    6c36be7dce2f        303MB
docker-sflow               latest              6c36be7dce2f        303MB
docker-teamd               HEAD.79-75104bb3    2ceb9665b7f6        302MB
docker-teamd               latest              2ceb9665b7f6        302MB
docker-snmp-sv2            HEAD.79-75104bb3    f538aadee337        334MB
docker-snmp-sv2            latest              f538aadee337        334MB
docker-orchagent           HEAD.79-75104bb3    5b4f589c0ca5        321MB
docker-orchagent           latest              5b4f589c0ca5        321MB
docker-fpm-frr             HEAD.79-75104bb3    95bbf9c7b251        319MB
docker-fpm-frr             latest              95bbf9c7b251        319MB
docker-sonic-telemetry     HEAD.79-75104bb3    85598c8154c7        304MB
docker-sonic-telemetry     latest              85598c8154c7        304MB
docker-platform-monitor    HEAD.79-75104bb3    a7866f173cd5        560MB
docker-platform-monitor    latest              a7866f173cd5        560MB

@stepanblyschak
Copy link
Contributor Author

Fixed issue with revert #650
Now need to investigate why it caused long disruption

stepanblyschak pushed a commit to stepanblyschak/sonic-utilities that referenced this issue Apr 28, 2022
Update the meta code to support DNAT Pool changes (sonic-net#616)
[syncd] Fix notification on shutdown request (sonic-net#637)
Advance the submodule head of SAI (sonic-net#641)
Add new line in sai_meta_log_syncd fprintf call (sonic-net#649)
Fix Warmboot Issue when upgraded Image SAI return Switch Internal
OID not accounted in previous image. (sonic-net#654)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant