Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Implement CNI Status #114

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

MikeZappa87
Copy link
Contributor

@MikeZappa87 MikeZappa87 commented Feb 26, 2024

The recent CNI specification provided a couple new verbs. This is to implement Status. GC and other verbs will be added in the future in different PR.

https://github.com/containernetworking/cni/blob/main/SPEC.md

cni.go Outdated
// Status checks the status of the cni initialization
Status() error
// Status executes the status verb of the cni plugin
Status(ctx context.Context) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, it already existed the concept

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Status detected if the minimum number of networks was met. In containerd cri-plugin, its hard coded to 2 neworks. Loopback and Default CNI network configuration. I changed the existing to 'Ready' and have Status calling the status function out of libcni.

cni.go Outdated
}

for _, network := range c.Networks() {
err = network.Status(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is something you may need to document, as this means if any of the networks is not ready, the status is not ready

You may want to do, if just one network is ready then status is ready 🤷

Copy link
Contributor Author

@MikeZappa87 MikeZappa87 Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cri-plugin case, we load two networks. One is the loopback plugin as separate (

go-cni/opts.go

Line 88 in 6603d5b

func WithLoNetwork(c *libcni) error {
)
The other 'network' is the first CNI network configuration in /etc/cni/net.d.

https://github.com/containerd/containerd/blob/7f0f49b43825b76e0271ed0102bf7df560826046/internal/cri/server/service_linux.go#L113

Since this is hard coded in containerd, we probably don't need to worry about this. However we could have a configurable failure behavior. If 1 fails, they all fail, if one is successful the node has the network ready condition set. We could also return a structure that has the CNI network name and error? This would return a slice of this structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikebrow thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one is the loopback plugin, taht is always going to be ready and not be able to provide connectivity, then it makes sense for ready to be an AND of all plugins instead of an OR, as you will need the loopback AND the other CNI to be able to provide networking to the pods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this should be left up to the caller of the Status function? In this case for the containerd cri plugin I think it makes sense for AND however go-cni is used by others as well, we probably shouldn't dictate behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be AND until we have a config or parameter from the client indicating "optional"

@MikeZappa87 MikeZappa87 changed the title WIP: Implement Status WIP: Implement CNI Status Feb 27, 2024
@MikeZappa87
Copy link
Contributor Author

@aojea Since the kubelet calls the CRI-API Status RPC: https://github.com/containerd/containerd/blob/6333db77015764f7d45b407094a3bc2cc05c239b/internal/cri/server/status.go#L49

Kubelet: https://github.com/kubernetes/kubernetes/blob/3718932d9a3ea459217ca11efe1ae9ead6aa92c1/pkg/kubelet/kubelet.go#L2886

The Status returning an error will set the node to not ready. I did notice after about 2 hours, the pods with netns, started going to completed. It might have been my test however I think we should discuss.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do it with a v2 api.. since v2 is not compatible with v1...

// Status checks the status of the cni initialization
Status() error
// Status executes the status verb of the cni plugin
Status(ctx context.Context) ([]*NetworkStatus, error)
Copy link
Member

@mikebrow mikebrow Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatusDetail(ctx)...?

}

return networks, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check.. eol at elf

type NetworkStatus struct {
Network *Network
Status error
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

@MikeZappa87 MikeZappa87 changed the title WIP: Implement CNI Status [DRAFT] Implement CNI Status Apr 15, 2024
@architkulkarni
Copy link

Hi @MikeZappa87, do you plan to work on this? If not, I can try to work on it as a new contributor.

@MikeZappa87
Copy link
Contributor Author

Hi @MikeZappa87, do you plan to work on this? If not, I can try to work on it as a new contributor.

Hello, I need to finish this up. I started it and we needed to get a few PR's merged prior to finishing this. If you want to take it on feel free! :-)

@architkulkarni
Copy link

Thanks, I'll give it a shot!

@architkulkarni
Copy link

architkulkarni commented Jul 15, 2024

Hi @MikeZappa87, a couple basic questions for you if you have time:

  • What's the v2 API, is there a reference somewhere?
  • (Maybe related) The current draft PR changes the signature of Status(), and you mentioned above that kubelet calls this function already. How do we handle backwards compatibility for things like this?

More generally, it would be helpful to know what work is remaining and if we need to wait for any other PRs to be merged first. Thanks!

@aojea
Copy link
Contributor

aojea commented Aug 27, 2024

@aojea Since the kubelet calls the CRI-API Status RPC: https://github.com/containerd/containerd/blob/6333db77015764f7d45b407094a3bc2cc05c239b/internal/cri/server/status.go#L49

Kubelet: https://github.com/kubernetes/kubernetes/blob/3718932d9a3ea459217ca11efe1ae9ead6aa92c1/pkg/kubelet/kubelet.go#L2886

The Status returning an error will set the node to not ready. I did notice after about 2 hours, the pods with netns, started going to completed. It might have been my test however I think we should discuss.

that is the bug we want to fix, though CNI implementations need to implement CNI STATUS wisely as with great power comes great responsibility

@architkulkarni
Copy link

that is the bug we want to fix, though CNI implementations need to implement CNI STATUS wisely as with great power comes great responsibility

@aojea could you say a bit more about what the "bug" here is? Is it that "the pods with netns started going to completed" would be a kubelet bug?

@squeed
Copy link
Contributor

squeed commented Sep 9, 2024

@MikeZappa87 can we wake this back up?

@MikeZappa87
Copy link
Contributor Author

@squeed yes we can!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants