Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workaround] EvpnRemoteVnip2pOrch warmboot check failure #2626

Merged
merged 3 commits into from
Feb 13, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions orchagent/vxlanorch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2347,6 +2347,14 @@ bool EvpnRemoteVnip2pOrch::addOperation(const Request& request)
return true;
}

EvpnNvoOrch* evpn_orch = gDirectory.get<EvpnNvoOrch*>();
jcaiMR marked this conversation as resolved.
Show resolved Hide resolved
auto vtep_ptr = evpn_orch->getEVPNVtep();
if (!vtep_ptr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow convention. { -> new line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update in latest diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

SWSS_LOG_WARN("Remote VNI add: Source VTEP not found. remote=%s vid=%d",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this log say during warm boot ? Does it point to anything valid ?
ie what is the remote ip and what is the vid ?
If it is some spurious data then we need to investigate that as this can be seen on any of the other tables in other OAs.

Quite clearly the null check cannot be the fix especially in a non EVPN scenario.
We would need to see why is this being called when there is no explanation for the presence of an entry in the VXLAN_REMOTE_VNI table.

Copy link
Contributor Author

@jcaiMR jcaiMR Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcaiMR @prsunny Do we know why EvpnRemoteVnip2pOrch::addOperation is called in non EVPN scenario. This will generally be called when VXLAN_REMOTE_VNI table is present. Is it populated in non evpn scenario?

Yes, the issue happened in non-evpn scenario, seems in function OrchDaemon::init(), we directly create evpn_remote_vni_orch and push into m_orchList.

 if (vxlan_tunnel_orch->isDipTunnelsSupported())
    {
        EvpnRemoteVnip2pOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2pOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }
    else
    {
        EvpnRemoteVnip2mpOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2mpOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }

And warmboot check function OrchDaemon::warmRestartCheck( ) checked all orch agents in m_orchList, and finally call addOperation() for each orch agent. I didn't find any check logic for VXLAN_REMOTE_VNI table before walking orch agent list, Do I missed something ?
attached some trace during warmboot:

Thread 1 "orchagent" hit Breakpoint 1, EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476
2476    vxlanorch.cpp: No such file or directory.
(gdb) bt
#0  EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476
#1  0x000055ecbb8d1c60 in Orch2::doTask (this=0x55ecbd1e7af0, consumer=...) at orch.cpp:1063
#2  0x000055ecbb8d5a9e in Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:241
#3  Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:238
#4  Consumer::execute (this=0x55ecbd1e85b0) at orch.cpp:235
#5  0x000055ecbb8c54c9 in OrchDaemon::start (this=this@entry=0x55ecbd0bfd80) at orchdaemon.cpp:755
#6  0x000055ecbb843fe0 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:735
(gdb) c

Having the p2premotevniorch in the orchList does not explain why there was an event. Could you dump the contents of the VXLAN_REMOTE_VNI table before the WB command is issued ? Does it have any entries ?

keys VXLAN_REMOTE_VNI

Thanks for comments. we don't have EVPN configured and "sonic-db-cli APPL_DB keys "VXLAN_REMOTE_VNI*"" is empty.
image

Do you mean if VXLAN_REMOTE_VNI table is not exists, there will no event to trigger EvpnRemoteVnip2pOrch/EvpnRemoteVnip2mpOrch addOperation ? Then here seems we need dig more to see what kind of events from Consumer table right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes addOperation cannot be called unless the VXLAN_REMOTE_VNI table is populated.

My understanding is that warmRestartCheck will not call addOperation.
It just checks the sizes of the m_toSync map and if it is non zero then returns a false.

So one possible explanation is there was an earlier operation which resulted in the VXLAN_REMOTE_VNI being populated and resulting in a failure that you saw.
Under valid EVPN VXLAN configuration one is not expected to see this failure.

WB operation later on results in the readiness check failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion, will have some further debug to find the event trigger.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srj102 I agree with your comments. However, I think we need to include this work-around for the short term. @jcaiMR please leave the issue open after this PR merged. (remove the fix PR comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, have modified the title and removed Fix #. Will use other PR for the root cause solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srj102 are you ok to remove the blocker on this PR?

remote_vtep.c_str(), vlan_id);
return true;
}

VxlanTunnelOrch* tunnel_orch = gDirectory.get<VxlanTunnelOrch*>();
Port tunnelPort, vlanPort;

Expand Down