Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3178: Cleaning up IPTables Chain Ownership #3179

Merged
merged 1 commit into from
Jun 10, 2022

Conversation

danwinship
Copy link
Contributor

  • One-line PR description: Initial draft of KEP-3178, to clean up kubelet-vs-kube-proxy IPTables chain ownership

  • Issue link: Cleaning up IPTables Chain Ownership #3178

  • Other comments: (I'm filing this now but targeting it at 1.25 for alpha given that it requires changes to the same parts of the iptables proxy that there are already multiple other pending KEP PRs for.)

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 23, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. labels Jan 23, 2022
@aojea
Copy link
Member

aojea commented Feb 4, 2022

/cc

keps/sig-network/3178-iptables-cleanup/README.md Outdated Show resolved Hide resolved
keps/sig-network/3178-iptables-cleanup/README.md Outdated Show resolved Hide resolved
post.)

Kube-proxy will be updated to not use `KUBE-MARK-DROP`, as described
above. (This change is unconditional; it is _not_ feature-gated.) We
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why no gate? What happens if we've done something disastrously dumb - how can a cluster back out of this change?

Copy link
Contributor Author

@danwinship danwinship Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this just changes kube-proxy's internals without actually changing the visible functionality, I was considering it a cleanup/bugfix, not a feature. (So if we do something disastrously dumb, and somehow don't catch it, then affected users would have to downgrade, just like with any other major bug.)

If we feature-gate the kube-proxy change, then we'd have to push out the rest of the schedule out by two releases, because you can't have the kubelet feature gate go to beta until 2 releases after kube-proxy starts not depending on KUBE-MARK-DROP. (So we'd have two separate feature gates, a lot like the "Move KUBE-MARK-DROP to Kube-Proxy Rather Than Removing It" example under "Alternatives".)

keps/sig-network/3178-iptables-cleanup/README.md Outdated Show resolved Hide resolved
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except ACCEPT questions,

Thanks for this, long overdue

keps/sig-network/3178-iptables-cleanup/README.md Outdated Show resolved Hide resolved
ever seemed to care. Given the current state, in which only
LoadBalancer traffic gets firewalled, the simplification from reusing
the existing matching logic in the `filter` table seems to be
outweighed by the complication of having to deal with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 50/50 - having MARK-DROP in the nat chain puts all the logic in one place, which is nice. That said, I buy the REJECT consistency below

of the chain, it calls `KUBE-MARK-DROP` on any unmatched packets.

But there is no reason why the `FW` chain could not simply be moved to
the `filter` table (with calls to `-j ACCEPT` for allowed source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, KUBE-SERVICES is called from PREROUTING and OUTPUT ... these new filtering rules has to go to the filter tables on INPUT, FORWARDING and OUTPUT , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meh, nat PREROUTING happens before anything in filter. One of these days I'll remember how iptables ordering works without having to google it...

@thockin didn't you have some kube-proxy iptables-processing-order infographic somewhere?

Copy link
Member

@aojea aojea Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this printed
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 we miss the service IP on the filter tables, DNAT was already performed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, this worked right in the original proposal (keep the FW chain in NAT, fall through rather than KUBE-MARK-DROP at the end, have the KUBE-FIREWALL chain in filter drop any not-yet-DNAT-ed packets addressed to the LB IP). I changed it in part because of Tim's comment here but (a) as long as we keep doing the pod-to-LB-IP short-circuiting, the original proposal works, and (b) if we stop doing the short-circuiting in the future, we'd have to rewrite all the rules anyway, and we can just address this then.

- `KUBE-MARK-DROP` and `KUBE-FIREWALL`

`KUBE-MARK-DROP` marks packets as needing to be dropped by setting
a _different_ configurable bit on the packet mark. `KUBE-FIREWALL`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KUBE-FIREWALL was also extended to deal with the route_localnet CVE
kubernetes/kubernetes#91569

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, forgot about that.

So this is its own mess: 91569 only talks about NodePort, but the problem theoretically applies to HostPorts too. Except, it only applies if you set route_localnet, which kube-proxy does and kubelet doesn't... so for our own purposes we can move 91569's rule from kubelet to kube-proxy. (In fact, it should have been there originally, but KUBE-FIREWALL sounds like a good place to put it, and that's created from kubelet, so it got put there.)

But given that it was previously in kubelet, there are probably alternate proxies and hostport implementations that would be vulnerable if we didn't have this rule... so we may have to keep creating it...

We should probably have a KEP about eventually getting rid of localhost NodePort/HostPort behavior...

@aojea
Copy link
Member

aojea commented Feb 24, 2022

+1 but I suggest to postpone one release waiting for the dockershim feedback, it seems is causing more drama than expected and adding more changes on top can complicate it more

@danwinship
Copy link
Contributor Author

+1 but I suggest to postpone one release waiting for the dockershim feedback, it seems is causing more drama than expected and adding more changes on top can complicate it more

Current proposal in kep.yaml is alpha in 1.25, and we can decide in the 1.25 timeframe if that still makes sense. But we should merge soon so external components know what's coming.

@aojea
Copy link
Member

aojea commented Feb 24, 2022

+1 but I suggest to postpone one release waiting for the dockershim feedback, it seems is causing more drama than expected and adding more changes on top can complicate it more

Current proposal in kep.yaml is alpha in 1.25, and we can decide in the 1.25 timeframe if that still makes sense. But we should merge soon so external components know what's coming.

I wanted to mean 1.25, I missed that detail
lgm

@danwinship
Copy link
Contributor Author

Updated:

  • squashed, removed UNRESOLVED section
  • reverted back to original LoadBalancerServices proposal, which fixes the iptables-ordering problem and resolves Tim's objection to splitting up the service logic between nat and filter.
  • explained the martian packet problem and clarified that we should probably have kubelet keep creating the rule protecting against that, even though it's not kubelet's problem 🙁

I wanted to mean 1.25, I missed that detail

yeah, I didn't want to try to stack another iptables proxier change on top of what's already in 1.24...

@danwinship
Copy link
Contributor Author

@thockin can we merge this and then merge kubernetes/kubernetes#109059 to add the KUBE-IPTABLES-HINT chain for 1.24?

You had previously "mostly LGTM"ed it, and I added some "UNRESOLVED" sections making it explicit that the low-level details are still subject to change. But I'd like to get this as the plan of record so that we can (1) add KUBE-IPTABLES-HINT in 1.24, and (2) do a blog post about the upcoming iptables rules reorg and what third-party components need to know (which I've already started writing).

kube-proxy. But because they are generally doing much less IPTables
processing than kube-proxy does, they could fairly easily be rewritten
to not use the packet mark, and just have slightly-redundant
`PREROUTING` and `POSTROUTING` rules instead.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can fix CNI to not use a mark, though it will have a lot of duplicated logic.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More relevantly, once this goes out, I can reference the deprecation in whatever CNI release notes make sense. We only added that option so that users could save mark bits.

be the one creating the fix for it as well, and we should make
kube-proxy create this filtering rule itself.

However, it is possible that users of other proxy implementations (or
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CNI portmap plugin is also in this situation -- it enables route_localnet as well.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 28, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 28, 2022
@danwinship
Copy link
Contributor Author

/assign @johnbelamaric
for PRR approval
(and can you /lgtm too because I just had to re-push with the prod-readiness file filled in, but it's otherwise unchanged from the version Tim lgtm'ed)

@aojea
Copy link
Member

aojea commented Jun 2, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2022
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

It's OK as is but a few more hints for users to know if this is causing issues would be nice.


###### What specific metrics should inform a rollback?

Any failures would be the result of third-party components being
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we can't predict what some other tool will do. But perhaps we could give some hints as to where folks could look. For example, what components (CNI, netpol implementation, etc.) are likely to have potential errors or failures?

There is no simple way to do this because if the feature is working
correctly there will be no difference in externally-visible behavior.
(The generated iptables rules will be different, but the _effect_ of
the generated iptables rules will be the same.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature isn't really "used by workloads" so much as it is "configured by the administrator", so you can say something that effect.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, johnbelamaric, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2022
@johnbelamaric
Copy link
Member

by the way, I don't see this on the enhancements spreadsheet for 1.25, please add it there.

@danwinship
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit 27ddd8f into kubernetes:master Jun 10, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jun 10, 2022
@danwinship danwinship deleted the iptables-cleanup branch June 13, 2022 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants