Aporeto is throwing an error indicating a token is invalid. #633

jleach · 2020-02-20T19:51:25Z

In various namespaces we see a rejected flow when the kublet tries to talk make a health check. The error message is: "token (The token was invalid.)". We though this might be related to deadlock in the enforcers but this issue was fixed by Aporeto in v.1.1015.10. After the update to v1.1015.10 we are still seeing the error message.

While not directly correlated by time, when this error message is present teams using Python/Flask often report that a connection fails between the API and SSO causing the SSO JWT to become invalid.

jleach · 2020-02-25T19:03:34Z

Feb 11th: Worked with DXC to down the Aporeto enforcers. While down Karim did not see any errors in his Application(s) where they were unable to validate a JWT with SSO.

jleach · 2020-02-25T19:05:52Z

Debugging this with Aporeto more today. We brought down Sysdig and this cleared up Karim's application errors (problem connecting to SSO to validate JWT) but the reject flow error message persists.

It appears that when either the Enforcers or Sysdig is disabled then the Application errors go away. Will work with William and Aporeto this afternoon to gather packet logs for a more in-depth analysis.

jleach · 2020-03-02T20:06:28Z

We tracked this issue down to an overflow of iptable's NFQUEUE on the three infra nodes. It appears that they are processing so much traffic that is causing this queue to fill and packets to get dropped. As a fix, Aporeto has suggested we increase the number of queues to better load balance over enforcer(s)/pods. Solution currently being added to our playbook.

The log message that indicates this type of problem (from an infra node):

[root@ociopf-p-170 ~]# journalctl --since "24 hours ago" | grep nf_queue | head
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)

The solution is the change both txQueueCount and rxQueueCount from 2 to 4 on the infra nodes only. This will increase the amount of memory used by ~70MB per pod.

  # transmitter queue size
  # NOTE: change only when instructed to do so by Aporeto support
  txQueueCount: 2

  # receiver queue size
  # NOTE: change only when instructed to do so by Aporeto support
  rxQueueCount: 2

Adjust Aporeto Playbook for enforcer-infra DaemonSet - tx/rx_queue_count set to 4 for infra enforcers

j-pye · 2020-03-04T19:32:38Z

Overview of the Issue and Suggested Solution
As a fix Marcus told us to increase the tx/rx queue count for enforcers on the infra nodes to 4.
This fix was aimed at fixing our "token" issues as well as the dropped connections or dropped queue items which can be seen in the Enforcer Event Logs for PROD from the Aporeto console.

In order to do this we had to split our enforcer daemonset into 2 daemonsets.
One daemonset for the master and app nodes, and one for the infra nodes.

Each daemonset was setup with the aporeto-enforcerd nodeSelector and a required affinity setting either for infra nodes or not for infra nodes.

Deployment Lab/Prod
Deployment to our Lab cluster had no issues, however, our deployment to the Prod cluster had significant issues requiring us to disable Aporeto. Some enforcers started successfully but others failed to start within 5 seconds causing the enforcer to crash.

Questions - Post Incident

Could the enforcers starting on master and infra nodes at the same time cause issues for other enforcers or the cluster?
Was the Cluster Backend unable to handle all of the requests by several enforcers starting at once, or due to the number of pods on each node?

Notes:

Dating back to January (possible before) there are thousands of enforcer errors a day. Most of which appear to relate to the tx/rx queue count we were aiming to fix.

jleach · 2020-03-04T19:43:25Z

Two support cases opened with Aporeto:

246 - Improving the helm chart(s) to better allow for this change to be deployed.

jleach · 2020-03-12T17:41:43Z

Blocked by #645.

jleach · 2020-03-12T17:47:45Z

This issue appears to be intermittent on compute nodes (Ref. #devops-apoerto Feb 28). We had a conversation today where we decided to push ahead with the param changes on the infra nodes only and, if the same issue is confirmed on the infra nodes we'll look at also enabling the config on compute nodes. It looks like we have enough memory to accommodate this change.

jleach added bug Something isn't working security Platform Security labels Feb 20, 2020

jleach self-assigned this Feb 25, 2020

j-pye added a commit that referenced this issue Mar 3, 2020

Update Aporeto Enforcers for Infra Nodes (#633, #636)

695afe3

Adjust Aporeto Playbook for enforcer-infra DaemonSet - tx/rx_queue_count set to 4 for infra enforcers

tosazuwa assigned j-pye Mar 4, 2020

tosazuwa added the blocked Work cannot progress ATM label Mar 4, 2020

jleach mentioned this issue Mar 12, 2020

Unable to restart Aporeto after enforcer update #645

Open

jleach removed the blocked Work cannot progress ATM label Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aporeto is throwing an error indicating a token is invalid. #633

Aporeto is throwing an error indicating a token is invalid. #633

jleach commented Feb 20, 2020 •

edited

Loading

jleach commented Feb 25, 2020 •

edited

Loading

jleach commented Feb 25, 2020

jleach commented Mar 2, 2020 •

edited

Loading

j-pye commented Mar 4, 2020 •

edited

Loading

jleach commented Mar 4, 2020 •

edited

Loading

jleach commented Mar 12, 2020

jleach commented Mar 12, 2020

Aporeto is throwing an error indicating a token is invalid. #633

Aporeto is throwing an error indicating a token is invalid. #633

Comments

jleach commented Feb 20, 2020 • edited Loading

jleach commented Feb 25, 2020 • edited Loading

jleach commented Feb 25, 2020

jleach commented Mar 2, 2020 • edited Loading

j-pye commented Mar 4, 2020 • edited Loading

jleach commented Mar 4, 2020 • edited Loading

jleach commented Mar 12, 2020

jleach commented Mar 12, 2020

jleach commented Feb 20, 2020 •

edited

Loading

jleach commented Feb 25, 2020 •

edited

Loading

jleach commented Mar 2, 2020 •

edited

Loading

j-pye commented Mar 4, 2020 •

edited

Loading

jleach commented Mar 4, 2020 •

edited

Loading