-
Notifications
You must be signed in to change notification settings - Fork 114
Conversation
@sameo @gnawux @laijs @bergwolf I have updated this PR but I am still not happy about the way the different mounts are handled through what we call Please take a look at the PR during those next days and tell me if something is wrong with this. |
These are removed without any components replaced. But they will be replaced by Storage if my suggestion is accepted. |
The runtime passes this storage config to the agent when starting sandbox: {
"driver":"",
"source":"sharedfs_tag",
"fstype":"9p",
"options": "nodev,trans=virtio",
"mount_point": "/storage/sharedfs"
} Then agent mount it on "/storage/sharedfs". |
When start a container, Storages could be
And "/storage/abc/rootfs" and "/storage/xyz/_data" (notice the docker style "rootfs" and "_data") can be used in the oci.Specs for container root and volume. Without new components, it is hard to do such things with oci.Specs merely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
half reviewed. I'll look into the remaining half later today.
agent.go
Outdated
} | ||
|
||
// Connect gRPC socket. | ||
conn, err := net.Dial(gRPCSockScheme, gRPCSockPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to retry a bit here since the gRPC server might not be up ready to accept connections yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we dial into the gRPC socket. We know this socket is being listened since startGRPC()
first listen onto this socket before to give the listener to server.Serve()
inside the go routine. There is no reason the Dial would fail in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, you are right.
agent.go
Outdated
agentLog.Error(err) | ||
break | ||
} | ||
defer stream.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defer
function will keep the stream open until loopYamux returns. IWO, it introduces fd leaks for yamux connections.
agent.go
Outdated
|
||
go func() { | ||
io.Copy(grpcConn, stream) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can be simplified as just go io.Copy()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
channel.go
Outdated
} | ||
|
||
func isAFVSockSupported() (bool, error) { | ||
fd, err := unix.Socket(unix.AF_VSOCK, unix.SOCK_STREAM, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modprobe vsock first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to modprobe vsock ? Don't we want VSOCK to be in-kernel (guest kernel) if we support it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, fine with me if vsock is built-in.
} | ||
|
||
//Get the agent configuration from kernel cmdline | ||
func (c *agentConfig) getConfig(cmdLineFile string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so we set the log level via kernel command lines. I cannot think of a better place to set it though. It seems this will become a secret ABI... Please document it explicitly somewhere loud.
agent.go
Outdated
var l net.Listener | ||
|
||
if p.isVSock() { | ||
l, err = vsock.Listen(1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: constify the port number and make it part of agent API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
channel.go
Outdated
type channelType string | ||
|
||
const ( | ||
serialChannelName = "agent.channel.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to somewhere like api.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can do that.
) | ||
|
||
type agentConfig struct { | ||
logLevel logrus.Level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we want to use some kind of wrapper for logrus/glog in the agent as well?
agent.go
Outdated
} | ||
|
||
func (p *pod) initLogger() error { | ||
agentLog.Logger.Formatter = &logrus.JSONFormatter{TimestampFormat: time.RFC3339Nano} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use text format for agent? logs are most likely to be read by eyes and JSON is not very readable for human eyes IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree.
agent.go
Outdated
|
||
const ( | ||
gRPCSockPath = "/var/run/kata-containers/grpc.sock" | ||
gRPCSockScheme = "unix" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are needed. unix sock scheme proxy is only needed in the host side since proxy and shim/runtime are different processes. The origin design doesn't include it.
|----runv------|----proxy-------|-------hyperstart
container 1 ---|--------\ /---
container 2 ---|--------->------|----<---- hyperstart actions
exec 1---------|--------/ \---
grpc unixsock yamux serial yamux grpc
(@gnawux had drawn a picture from this ascii)
yamux.Server()'s return value implements net.Listener. So it can be directly used with grpc(grpcServer.Serve()).
However, I haven't tried. Cloud you have a try please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laijs good point. So basically we'd pass the yamux net.Listener
to startGRPC()and have
grpcServer.Serve()` take it as its argument.
It would avoid all the frame copying between the yamux server and the gRPC one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes good point, I haven't realized that. It's going to simplify the code.
grpc.go
Outdated
if err := setupDNS(a.pod.network.dns); err != nil { | ||
return emptyResp, err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pod pid namespace might be needed to be created. hyperstart creates one.
the pod's pid namespace's init process acts as infra container.
Or let it become an option in the StartSandboxRequest config?
@sameo @gnawux
One more question, when infra container was requested from the k8s, will runtime ask the agent create it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, the agent should be designed to be always free of the docker pid1 problem IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf That one is one pid namespace per container, which is the reason docker has the pid1 problem. We need to introduce a pod level pid namespace, or at least make it configurable in StartSandbox()
.
@laijs we have a running agent in each pod, I don't think the k8s pause container is necessary currently. But I'm not sure if k8s community will use it to do something other than zombie process recycling in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sameo Not really, the agent (not systemd nor sandbox_pidns nor any additional subreaper like containerd-shim) should be the subreaper of all the containers&processes and possible helper processes. If the agent runs as the vm's init process, like hyperstart, "agent as subreaper" was already done, in the other side when agent is not the vm's init process, some code is needed to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf the implementation for sandbox_pidns can be postponed. Any other feature/behavior can be also be delayed to next PRs when it is complicated or under discussion. A issue might need to be created to track the unimplemented feature/behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grpc.go
Outdated
winSize, err := unix.IoctlGetWinsize(fd, unix.TIOCGWINSZ) | ||
if err != nil { | ||
return emptyResp, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to get the winsz at first, for we are going to modify all the values.
TIOCGWINSZ struct winsize *argp
Get window size.
TIOCSWINSZ const struct winsize *argp
Set window size.
The struct used by these ioctls is defined as
struct winsize {
unsigned short ws_row;
unsigned short ws_col;
unsigned short ws_xpixel; /* unused */
unsigned short ws_ypixel; /* unused */
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yes, if x and y position is unused I agree there is no need to get the window size before to set it.
@sameo any thoughts about this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, you don't need to get winsize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I forgot to remove that, it's going to be fixed in next PR version.
grpc.go
Outdated
if err := setupDNS(a.pod.network.dns); err != nil { | ||
return emptyResp, err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, the agent should be designed to be always free of the docker pid1 problem IMO.
agent.go
Outdated
if err = p.startYamux(); err != nil { | ||
return | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libcontainer relies on cgroups. Do we need to setup cgroups mounts here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bergwolf here is the cgroups definition for the current container: https://github.com/kata-containers/agent/pull/1/files#diff-727ccd4217b7f335e689be03bfb69e9bR324
We need to provide any cgroup we want through the OCI Spec we're passing to the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rephrase my question - does libcontainer mount cgroup mountpoints if any of the cgroup configs are present in the OCI spec? If not, the agent should mount them before calling into libcontainer.
mount.go
Outdated
return nil | ||
} | ||
|
||
u := goudev.Udev{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the agent relies on libudev, eek. I'm OK with it but please make the function generic as we need to wait for nic hotplug as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll try to make this better.
grpc.go
Outdated
a.pod.Lock() | ||
for key, c := range a.pod.containers { | ||
if err := c.removeContainer(); err != nil { | ||
return emptyResp, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If runtime wants to destroy sandbox, agent should skip errors and continue to clean up as much as possible before returning IMHO.
Or, if we do not care that much, simply kill all processes, sync
all dirty page cache and return. The vm is about to be brought down soon so any non-persistent data does not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure we can properly tear down a pod and its setup. This will avoid any issues related to unmounted directories, network configuration or other.
Also, I think we might want to support (for the future) a RestartSandbox()
. In that case, we would save the launching time of the VM and it would result into a faster accessible pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Just want to remind that if there are any issues with unmounted directories, network configuration or other, DestroySandbox()
returns an error, and the only option runtime has is to shutdown the vm anyway.
BTW, it seem c.removeContainer()
does not kill/stop the container but just wait endlessly. Then DestroySandbox()
possibly becomes an API running forever?
agent.go
Outdated
|
||
const ( | ||
gRPCSockPath = "/var/run/kata-containers/grpc.sock" | ||
gRPCSockScheme = "unix" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laijs good point. So basically we'd pass the yamux net.Listener
to startGRPC()and have
grpcServer.Serve()` take it as its argument.
It would avoid all the frame copying between the yamux server and the gRPC one.
agent.go
Outdated
|
||
go func() { | ||
io.Copy(stream, grpcConn) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @laijs is highlighting, this may not be needed if you can pass the yamux listener directly to the grpc.Serve()
routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed
channel.go
Outdated
|
||
type channel interface { | ||
getType() channelType | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we should have a different channel interface:
type channel interface {
open() error
close() error
listen() (*net.Listener, error)
}
With that interface:
- I would implement a
newChannel() (*channel, error)
function that would implement the entire logic of checking if we are going to do vsock or serial, and respectively return the vsock or serial implementation. - The
func (p *pod) initChannel() error
would become something like:
func (p *pod) initChannel() error {
c, err := newChannel()
if err != nil {
return err
}
p.channel = c
return p.channel.open()
}
- The serial
open()
implementation would create the yamux server on its serial path and store the yamux session. - The serial
listen()
implementation would simply return the yamux session func (p *pod) startGRPC() (err error)
would be simplified to:
func (p *pod) startGRPC() error {
l, err := p.channel.listen()
if err != nil {
return err
}
grpcImpl := &agentGRPC{
pod: p,
}
grpcServer := grpc.NewServer()
pb.RegisterHyperstartServiceServer(grpcServer, grpcImpl)
p.wg.Add(1)
go func() {
defer p.wg.Done()
grpcServer.Serve(l)
}()
return nil
}
And so on...
I think the code would be simplified and more readable overall.
mount.go
Outdated
"time" | ||
|
||
pb "github.com/hyperhq/runv/hyperstart/api/grpc" | ||
goudev "github.com/jochenvg/go-udev" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one is a cgo wrapper around libudev. We can live with that for now, but we have to keep in mind that this makes our binary no longer static and brings glibc, lpthread and libudev dependencies.
And what we need from it is basically a netlink socket and monitor from it, so we probably should consider adding UEVENT support to github.com/vishvananda/netlink/. The genetlink support is already there, so it should not be a huge task (famous last words..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love that. This would be so much cleaner. I can take a look at this in parallel, so that we can improve pretty quickly.
|
||
defaultMountFlags := syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_NODEV | ||
|
||
config := configs.Config{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the configs here should be replaced by the config from the request, or merged from the request, if the request has the corresponding config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @laijs here. You probably want to convert the req.OCI
structure into an actual oci.OCI
and use as much as possible from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sameo @lajis, if I understand correctly, you mean that we should expect everything to be provide by req.OCI
? Are you sure we don't need some default values in the agent ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, agent can have its default values for the container, but the default value can be overwritten by the requested values. And runtime is also encouraged to pass more config even the agent has default value. For example runtime is encouraged to pass the procfs setting of the container into the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laijs Makes sense, I'll do that.
I prefer the name Sandbox rather than Pod. |
agent.go
Outdated
agentLog.Logger.Formatter = &logrus.JSONFormatter{TimestampFormat: time.RFC3339Nano} | ||
|
||
config := newConfig(defaultLogLevel) | ||
if err := config.getConfig(kernelCmdlineFile); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks quite strange to get log config from kernelCmdlineFile
though looks it works correctly.
PR updated |
Besides the fact that it needs to be rebased on top of the latest gRPC changes, this looks pretty good. The code is much cleaner and easier to read now. |
I have rebased on latest gRPC. I have also fixed the way I handle the storage mounts. |
mount.go
Outdated
return syscall.Unmount(mount, 0) | ||
} | ||
|
||
func waitForBlockDevice(devicePath string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this to waitForDevice
? The code is generic now and the name should be generic as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'll do that !
mount.go
Outdated
return nil | ||
} | ||
|
||
fieldLogger.Info("Started listening for uevents for block device hotplug") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the log entry, no more block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'll do that !
Except for the minor nits above, the PR looks in good shape and I'm giving my LGTM. |
@bergwolf thx for the LGTM, I will work today (and maybe tomorrow) on fixing last parts we have been talking about, and I expect something to be ready by end of week! |
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This first commit creates the agent overall functioning. This code initializes the connection through the virtual machine. This is connected either as a serial port, or as a virtual socket. Scanning the system allows to determine which type of connection should be used. In case of vsock, the connection is pretty simple since we get the gRPC server listening directly onto this. In case of serial connection, we introduce an extra layer using yamux in order to unmux the data coming from the proxy. After the data are decapsulated from yamux encapsulation, it is passed to the gRPC server. Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit relies on logrus package to give us some better logs. Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The agent will need to be able to listen to uevent in order to know when a hotplugged device from the host becomes available in the guest. Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit implements all functions related to gRPC interface defined in the protobuf file. Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We can handle the shared pidns case in a separate PR. |
I think the only thing missing from this PR is the s/pod/sandbox/ as requested by @laijs |
Because "sandbox" is a more generic term than "pod" which refers to a k8s term, this commit replaces every "pod" with "sandbox". Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
I have updated the PR today. The things missing are:
|
Since there are two lgtm's, I'm going to merge it and let other contributions in. |
do no merge - only to check move to go 1.10 works note: make issue number Fixes: kata-containers#1 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
do no merge - only to check move to go 1.10 works note: make issue number Fixes: kata-containers#1 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
Check if poiners are not nil. todo: fix commit Fixes: kata-containers#1 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
Check if poiners are not nil. todo: fix commit Fixes: kata-containers#1 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
If the container is not full created dont add it to the container list. Fixes: kata-containers#1 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
If the container is not full created dont add it to the container list. Fixes: kata-containers#1 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
If the container is not full created dont add it to the container list. Fixes: kata-containers#1 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
virtio-mmio: Add support for virtio-mmio blk devices
This PR creates the first implementation of the kata-containers agent relying on both Yamux and gRPC protocol.