-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3178: Cleaning up IPTables Chain Ownership #3179
Conversation
/cc |
post.) | ||
|
||
Kube-proxy will be updated to not use `KUBE-MARK-DROP`, as described | ||
above. (This change is unconditional; it is _not_ feature-gated.) We |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why no gate? What happens if we've done something disastrously dumb - how can a cluster back out of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this just changes kube-proxy's internals without actually changing the visible functionality, I was considering it a cleanup/bugfix, not a feature. (So if we do something disastrously dumb, and somehow don't catch it, then affected users would have to downgrade, just like with any other major bug.)
If we feature-gate the kube-proxy change, then we'd have to push out the rest of the schedule out by two releases, because you can't have the kubelet feature gate go to beta until 2 releases after kube-proxy starts not depending on KUBE-MARK-DROP
. (So we'd have two separate feature gates, a lot like the "Move KUBE-MARK-DROP
to Kube-Proxy Rather Than Removing It" example under "Alternatives".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except ACCEPT questions,
Thanks for this, long overdue
ever seemed to care. Given the current state, in which only | ||
LoadBalancer traffic gets firewalled, the simplification from reusing | ||
the existing matching logic in the `filter` table seems to be | ||
outweighed by the complication of having to deal with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 50/50 - having MARK-DROP in the nat chain puts all the logic in one place, which is nice. That said, I buy the REJECT consistency below
of the chain, it calls `KUBE-MARK-DROP` on any unmatched packets. | ||
|
||
But there is no reason why the `FW` chain could not simply be moved to | ||
the `filter` table (with calls to `-j ACCEPT` for allowed source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, KUBE-SERVICES is called from PREROUTING and OUTPUT ... these new filtering rules has to go to the filter tables on INPUT, FORWARDING and OUTPUT , right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meh, nat
PREROUTING
happens before anything in filter
. One of these days I'll remember how iptables ordering works without having to google it...
@thockin didn't you have some kube-proxy iptables-processing-order infographic somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 we miss the service IP on the filter tables, DNAT was already performed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, this worked right in the original proposal (keep the FW chain in NAT, fall through rather than KUBE-MARK-DROP at the end, have the KUBE-FIREWALL chain in filter drop any not-yet-DNAT-ed packets addressed to the LB IP). I changed it in part because of Tim's comment here but (a) as long as we keep doing the pod-to-LB-IP short-circuiting, the original proposal works, and (b) if we stop doing the short-circuiting in the future, we'd have to rewrite all the rules anyway, and we can just address this then.
- `KUBE-MARK-DROP` and `KUBE-FIREWALL` | ||
|
||
`KUBE-MARK-DROP` marks packets as needing to be dropped by setting | ||
a _different_ configurable bit on the packet mark. `KUBE-FIREWALL` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KUBE-FIREWALL was also extended to deal with the route_localnet CVE
kubernetes/kubernetes#91569
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, forgot about that.
So this is its own mess: 91569 only talks about NodePort, but the problem theoretically applies to HostPorts too. Except, it only applies if you set route_localnet
, which kube-proxy does and kubelet doesn't... so for our own purposes we can move 91569's rule from kubelet to kube-proxy. (In fact, it should have been there originally, but KUBE-FIREWALL
sounds like a good place to put it, and that's created from kubelet, so it got put there.)
But given that it was previously in kubelet, there are probably alternate proxies and hostport implementations that would be vulnerable if we didn't have this rule... so we may have to keep creating it...
We should probably have a KEP about eventually getting rid of localhost NodePort/HostPort behavior...
+1 but I suggest to postpone one release waiting for the dockershim feedback, it seems is causing more drama than expected and adding more changes on top can complicate it more |
Current proposal in kep.yaml is alpha in 1.25, and we can decide in the 1.25 timeframe if that still makes sense. But we should merge soon so external components know what's coming. |
I wanted to mean 1.25, I missed that detail |
a8a130e
to
14c5ba3
Compare
Updated:
yeah, I didn't want to try to stack another iptables proxier change on top of what's already in 1.24... |
14c5ba3
to
18edf22
Compare
18edf22
to
8add266
Compare
@thockin can we merge this and then merge kubernetes/kubernetes#109059 to add the You had previously "mostly LGTM"ed it, and I added some "UNRESOLVED" sections making it explicit that the low-level details are still subject to change. But I'd like to get this as the plan of record so that we can (1) add |
kube-proxy. But because they are generally doing much less IPTables | ||
processing than kube-proxy does, they could fairly easily be rewritten | ||
to not use the packet mark, and just have slightly-redundant | ||
`PREROUTING` and `POSTROUTING` rules instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can fix CNI to not use a mark, though it will have a lot of duplicated logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More relevantly, once this goes out, I can reference the deprecation in whatever CNI release notes make sense. We only added that option so that users could save mark bits.
be the one creating the fix for it as well, and we should make | ||
kube-proxy create this filtering rule itself. | ||
|
||
However, it is possible that users of other proxy implementations (or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CNI portmap
plugin is also in this situation -- it enables route_localnet
as well.
8add266
to
7e7e93e
Compare
7e7e93e
to
cb1e576
Compare
cb1e576
to
f8b4f47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
f8b4f47
to
610f9b5
Compare
/assign @johnbelamaric |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
It's OK as is but a few more hints for users to know if this is causing issues would be nice.
|
||
###### What specific metrics should inform a rollback? | ||
|
||
Any failures would be the result of third-party components being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that we can't predict what some other tool will do. But perhaps we could give some hints as to where folks could look. For example, what components (CNI, netpol implementation, etc.) are likely to have potential errors or failures?
There is no simple way to do this because if the feature is working | ||
correctly there will be no difference in externally-visible behavior. | ||
(The generated iptables rules will be different, but the _effect_ of | ||
the generated iptables rules will be the same.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature isn't really "used by workloads" so much as it is "configured by the administrator", so you can say something that effect.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, johnbelamaric, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
by the way, I don't see this on the enhancements spreadsheet for 1.25, please add it there. |
/retest |
One-line PR description: Initial draft of KEP-3178, to clean up kubelet-vs-kube-proxy IPTables chain ownership
Issue link: Cleaning up IPTables Chain Ownership #3178
Other comments: (I'm filing this now but targeting it at 1.25 for alpha given that it requires changes to the same parts of the iptables proxy that there are already multiple other pending KEP PRs for.)