Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aporeto is throwing an error indicating a token is invalid. #633

Open
jleach opened this issue Feb 20, 2020 · 7 comments
Open

Aporeto is throwing an error indicating a token is invalid. #633

jleach opened this issue Feb 20, 2020 · 7 comments
Assignees
Labels
bug Something isn't working security Platform Security

Comments

@jleach
Copy link
Collaborator

jleach commented Feb 20, 2020

In various namespaces we see a rejected flow when the kublet tries to talk make a health check. The error message is: "token (The token was invalid.)". We though this might be related to deadlock in the enforcers but this issue was fixed by Aporeto in v.1.1015.10. After the update to v1.1015.10 we are still seeing the error message.

While not directly correlated by time, when this error message is present teams using Python/Flask often report that a connection fails between the API and SSO causing the SSO JWT to become invalid.

Screen Shot 2020-02-20 at 11.41.22 AM.png

@jleach jleach added bug Something isn't working security Platform Security labels Feb 20, 2020
@jleach jleach self-assigned this Feb 25, 2020
@jleach
Copy link
Collaborator Author

jleach commented Feb 25, 2020

Feb 11th: Worked with DXC to down the Aporeto enforcers. While down Karim did not see any errors in his Application(s) where they were unable to validate a JWT with SSO.

@jleach
Copy link
Collaborator Author

jleach commented Feb 25, 2020

Debugging this with Aporeto more today. We brought down Sysdig and this cleared up Karim's application errors (problem connecting to SSO to validate JWT) but the reject flow error message persists.

It appears that when either the Enforcers or Sysdig is disabled then the Application errors go away. Will work with William and Aporeto this afternoon to gather packet logs for a more in-depth analysis.

@jleach
Copy link
Collaborator Author

jleach commented Mar 2, 2020

We tracked this issue down to an overflow of iptable's NFQUEUE on the three infra nodes. It appears that they are processing so much traffic that is causing this queue to fill and packets to get dropped. As a fix, Aporeto has suggested we increase the number of queues to better load balance over enforcer(s)/pods. Solution currently being added to our playbook.

The log message that indicates this type of problem (from an infra node):

[root@ociopf-p-170 ~]# journalctl --since "24 hours ago" | grep nf_queue | head
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)
Feb 27 12:29:38 ociopf-p-170.dmz kernel: nf_queue: full at 500 entries, dropping packets(s)

The solution is the change both txQueueCount and rxQueueCount from 2 to 4 on the infra nodes only. This will increase the amount of memory used by ~70MB per pod.

  # transmitter queue size
  # NOTE: change only when instructed to do so by Aporeto support
  txQueueCount: 2

  # receiver queue size
  # NOTE: change only when instructed to do so by Aporeto support
  rxQueueCount: 2

j-pye added a commit that referenced this issue Mar 3, 2020
Adjust Aporeto Playbook for enforcer-infra DaemonSet
- tx/rx_queue_count set to 4 for infra enforcers
@j-pye
Copy link
Contributor

j-pye commented Mar 4, 2020

Overview of the Issue and Suggested Solution
As a fix Marcus told us to increase the tx/rx queue count for enforcers on the infra nodes to 4.
This fix was aimed at fixing our "token" issues as well as the dropped connections or dropped queue items which can be seen in the Enforcer Event Logs for PROD from the Aporeto console.

In order to do this we had to split our enforcer daemonset into 2 daemonsets.
One daemonset for the master and app nodes, and one for the infra nodes.

Each daemonset was setup with the aporeto-enforcerd nodeSelector and a required affinity setting either for infra nodes or not for infra nodes.


Deployment Lab/Prod
Deployment to our Lab cluster had no issues, however, our deployment to the Prod cluster had significant issues requiring us to disable Aporeto. Some enforcers started successfully but others failed to start within 5 seconds causing the enforcer to crash.


Questions - Post Incident

  1. Could the enforcers starting on master and infra nodes at the same time cause issues for other enforcers or the cluster?
  2. Was the Cluster Backend unable to handle all of the requests by several enforcers starting at once, or due to the number of pods on each node?

Notes:

  • Dating back to January (possible before) there are thousands of enforcer errors a day. Most of which appear to relate to the tx/rx queue count we were aiming to fix.

@jleach
Copy link
Collaborator Author

jleach commented Mar 4, 2020

Two support cases opened with Aporeto:

246 - Improving the helm chart(s) to better allow for this change to be deployed.

@jleach
Copy link
Collaborator Author

jleach commented Mar 12, 2020

Blocked by #645.

@jleach
Copy link
Collaborator Author

jleach commented Mar 12, 2020

This issue appears to be intermittent on compute nodes (Ref. #devops-apoerto Feb 28). We had a conversation today where we decided to push ahead with the param changes on the infra nodes only and, if the same issue is confirmed on the infra nodes we'll look at also enabling the config on compute nodes. It looks like we have enough memory to accommodate this change.

@jleach jleach removed the blocked Work cannot progress ATM label Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working security Platform Security
Projects
None yet
Development

No branches or pull requests

3 participants