nsenter: major cleanups #950

cyphar · 2016-07-18T14:43:41Z

Just to make the code much easier to read, also removing redundant code. I also added the functionality that setns(2) will fail if we passed the wrong path to the bootstrap process (to avoid bugs in runC's namespace sending being non-fatal).

In addition, I switch pr_perror to bail, where the process will exit with a distinct error code. Currently we aren't handling syncT properly within the bootstrap process. But this is a good stopgap.

There are also lots of other fixes that make the code more robust, easier to read and much nicer to maintain. Some of this code comes from the rootless containers PR #774.

TODO:

Make errors distinctive (still not perfect but good enough for now).
Simplify netlink parsing code.
❌ Refactor pipes within nsenter and with bootstrap process. (not relevant here, there's only one pipe for bootstrap and one for the inter-process stuff).

There are other cleanup parts in #975 and #976 and #977.

Signed-off-by: Aleksa Sarai asarai@suse.de

mlaventure · 2016-07-18T16:06:56Z

libcontainer/nsenter/nsexec.c

+ * List of netlink message types sent to us as part of bootstrapping the init.
+ * These constants are defined in libcontainer/message_linux.go.
+ */
+#define INIT_MSG			62000


nit: some alignment issues here

Fixed. I set my editor to a tabsize of 4 by accident.

crosbymichael · 2016-07-18T16:29:40Z

For all your formatting needs...

https://github.com/crosbymichael/vim-cfmt

mlaventure · 2016-07-18T16:38:51Z

libcontainer/nsenter/nsexec.c

-			exit(1);
-		}
+		/* Sync with parent. */
+		if (read(syncpipe[0], &s, sizeof(s)) != sizeof(s) || s != SYNC_VAL)


can we add back the extras () for easier reading?

IMO it isn't easier to read with more parenthesis.

cyphar · 2016-07-18T16:38:55Z

@crosbymichael Cheers, I've added that to my dotfiles. Rebased. :P

mrunalp · 2016-07-18T16:52:58Z

libcontainer/nsenter/nsexec.c

 }

-static int clone_parent(jmp_buf *env, int flags) __attribute__((noinline));
-static int clone_parent(jmp_buf *env, int flags)
+static int clone_parent(jmp_buf * env, int flags) __attribute__ ((noinline));


Space between * and env :/

It looks like GNU indent doesn't know what the kernel style is. Fixed.

mlaventure · 2016-07-19T15:38:02Z

I like the new write_file take 👍

Although, the commit message will need to be changed before merging :)

cyphar · 2016-07-19T15:41:06Z

Of course, it's still a wip because I still don't like bits of how the id mapping management are done. There's also some more code that I want to pull out of the rootless containers patchset to put it here. :P

hqhq · 2016-07-22T07:00:19Z

libcontainer/nsenter/nsexec.c

-	}
+	len = write(fd, data, data_len);
+	if (len != data_len)
+		return -1;


Shall we close fd before return?

I'm currently thinking about that. In all of the error cases, we are going to exit so I don't think it really matters that we close all of the file descriptors. I'm currently trying to figure out how alloca interacts with clone (hint: the answer is "not nicely") but I might also end up just adding a memory leak because we're going to execve anyway (which clears all of the allocated memory from the old process anyway).

I might end up implementing it as a bunch of gotos with error strings but that would mean that we lose the benefit of exit(__COUNTER__ + 1).

otherwise you could return -(len != data_len) that way close(fd) will always be called

cyphar · 2016-07-22T16:37:47Z

Alright. I think I've included all of the refactorings and improvements that I wanted to. That was not as much fun as it looked. PTAL. Please make sure you test it on your local machine, as there's been quite a bit of oddness with our testbed recently.

/cc @crosbymichael @hqhq @mrunalp @dqminh

mlaventure · 2016-07-22T16:41:41Z

libcontainer/nsenter/nsexec.c

+
+					s = SYNC_USERMAP_ACK;
+					if (write(syncfd, &s, sizeof(s)) != sizeof(s)) {
+						kill(child, SIGKILL);


waitpid()?

What do you mean? We didn't waitpid before, and I don't think we can because the processes are cloned with CLONE_PARENT | SIGCHLD.

right, overlooked the CLONE_PARENT, sorry!

cyphar · 2016-07-27T12:02:01Z

/ping @opencontainers/runc-maintainers PTAL

cyphar · 2016-07-27T22:07:13Z

From this week's meeting, @crosbymichael suggested that we include a refactoring of the communication methods between the bootstrap process and the init. This would involve presumably figuring out how we should deal with both the netlink magic as well as the other bits and pieces.

Also, this is probably going to be the base for the console handling rewrite. So there's that. 😉

avagin · 2016-07-27T22:34:30Z

libcontainer/nsenter/nsexec.c

+
+	/* Retrieve data. */
+	size = NLMSG_PAYLOAD(&hdr, 0);
+	current = data = malloc(size);


Need to check that malloc doesn't return NULL here

NULL check still missing?

@cyphar I don't see a NULL check here..

Dammit, I must've accidentally put the fix in a different patch.

There, actually fixed this time.

haiyanmeng · 2016-08-01T15:16:41Z

@cyphar , @mrunalp , I tested this PR on Fodora 23, and make test finished successfully.

I also tested this PR on RHEL7, and the same failure happened as described in #915. However, I do not think the failure is due to this PR.
It seems that, on RHEL7, once a process joins another process's user namespace, it can not
run clone with the CLONE_NEWNS flag.

mrunalp · 2016-08-01T21:37:00Z

@cyphar I tested the PR. It doesn't fix #959.

cyphar · 2016-08-03T02:01:52Z

@mrunalp I believe this is related to @avagin's points about the changing of uid and gids. We need to set them to the root in the user namespace before unshare(USER) and then set them again after the unshare(USER) (before anything else). Working on it now.

cyphar · 2016-08-03T21:09:09Z

1, 2, 3 seems a bit complex and logically aren't they 3 identical set of euid/guid being set ? Can we just set them once before we do any clone/setns ? We can derive the euid/egid from:

If we're not using user namespaces, they're all equivalent. If we are using user namespaces, they are not equivalent.

The first one changes us to have euid = root in the namespace we're joining. WIthout this you get security vulnerabilities (as @avagin mentioned) because you have euid = (kuid 0) and a racing process could ptrace it and start executing code as a root process.

The second one does a similar thing, but for a new mapping that we're creating (it's the same sort of argument). And the third one is required to unshare the other namespaces (your user must be mapped if you're using SELinux or some security policies).

In fact, I think the fourth one isn't necessary.

cyphar · 2016-08-03T22:21:06Z

@opencontainers/runc-maintainers :: Okay, @dqminh and I had a discussion about the mutual exclusivity of setns(2) and unshare(2). The current way nsexec is written, they are not enforced to be exclusive. Personally I think changing join_namespaces to make them mutually exclusive would make the code quite ugly (you'd need to parse the namespace string in order to figure out the right UID to change to, which would require more allocations).

Would everyone be fine if we enforce this in nl_parse (like we do cloneflags). Or what if we didn't enforce it in C at all (since we can enforce it in Go)? I peronally prefer enforcing it in nl_parse as a policy decision rather than changing the way we join namespaces just to avoid this particular case (which may be useful in the feature).

hqhq · 2016-08-08T08:57:09Z

libcontainer/nsenter/nsexec.c

+			config->consolefd = open(current, O_RDWR);
+			if (config->consolefd < 0)
+				bail("failed to open console %s", current);
+			break;


Not introduced in this PR, I have a question about consolePath, is there any specific reasons that we have to handle consolePath in nsexec.c? It's only used by exec process, why can't we handle it in go code like we do for init process?

If possible, I think we should do so to make c code as simple as possible. @crosbymichael

@hqhq It's because we setns inside the C code, and we can't be sure of where the slave path will be inside the container's mount namespace (or even if there will be a path to it) we need to open it before we join the container's namespaces. This isn't a problem for when we create a container, because nothing changes after an unshare.

All of this will be reworked as part of the --console rewrite in #814. But first we need to merge some of this code that will lay the groundwork for that other stuff. :P

hqhq · 2016-08-08T09:06:52Z

@cyphar I think you can split this PR to speed up the process, the clean up part is a big change but can be merged rapidly, it'll also make other people easier to review and give other maintainers more confidence to merge :)

cyphar · 2016-08-08T09:56:00Z

@hqhq Fair point. :P I will break up this PR into three parts:

Cleanups. (nsenter: major cleanups #950)
The user namespace ordering and similar fixes (nsenter: guarantee correct user namespace ordering #977).
The setuid fixes. (nsenter: set {uid,gid} explicitly around namespace creation #975)
Fixes for the orphaning and similar issues (still a WIP). (nsenter: correctly handle pidns orphaning #976)

This PR blew up quite quickly. Hopefully the commit will rebase nicely. :P

cyphar · 2016-08-09T08:32:40Z

Okay, I think this PR is done (in terms of development). The rest of the changes will happen in the separate split PRs based on this one (see #975, #976, #977). PTAL so we can merge this finally.

@opencontainers/runc-maintainers

hqhq · 2016-08-12T09:21:18Z

LGTM, ping @mrunalp @crosbymichael @avagin @dqminh

dqminh · 2016-08-12T16:02:25Z

libcontainer/nsenter/nsexec.c

+	child = clone(child_func, ca.stack_ptr, CLONE_PARENT | SIGCHLD | flags, &ca);
+
+	/*
+	 * On old kernels, CLONE_PARENT didn't work with CLONE_NEWPID, so we have


nit: can we note an example of the last known kernel that exhibits this problem so maybe eventually we can remove this if possible ?
If we want to support those kernels, it might also make sense to setup CI for it ?

This is the commit that fixed this: torvalds/linux@1f7f4dd. And this is the commit which caused the issue: torvalds/linux@40a0d32.

The only mainline kernel which this affected is 3.12, but I can't really comment on kernels that this change might've been backported to.

dqminh · 2016-08-12T17:10:42Z

LGTM except a small nit

cyphar · 2016-08-12T17:16:46Z

Alright. Time for the final round of LGTMs.

/cc @opencontainers/runc-maintainers

Removed a lot of clutter, improved the style of the code, removed unnecessary complexity. In addition, made errors unique by making bail() exit with a unique error code. Most of this code comes from the current state of the rootless containers branch. Signed-off-by: Aleksa Sarai <asarai@suse.de>

dqminh · 2016-08-12T17:19:58Z

LGTM

mrunalp · 2016-08-12T18:08:10Z

@cyphar Where did the fix for #959 end up?

cyphar · 2016-08-13T03:49:06Z

@mrunalp It's in #975. The changes required to implement it require #977 (which is a pretty big change in of itself), whic his why I've split it like I have. This PR currently just cleans up the error handling, style and some other issues.

mrunalp · 2016-08-16T16:45:10Z

LGTM besides one comment about null check.

This just moves everything to one function so we don't have to pass a bunch of things to functions when there's no real benefit. It also makes the API nicer. Signed-off-by: Aleksa Sarai <asarai@suse.de>

cyphar · 2016-08-16T22:23:01Z

@mrunalp Fixed the NULL check.

/cc @opencontainers/runc-maintainers

avagin · 2016-08-16T22:49:11Z

LGTM

mrunalp · 2016-08-16T23:00:17Z

LGTM

cyphar · 2016-08-17T08:09:54Z

🎉 Now for the other PRs. :D

GordonTheTurtle added the status/0-triage label Jul 18, 2016

mlaventure reviewed Jul 18, 2016
View reviewed changes

mrunalp reviewed Jul 18, 2016
View reviewed changes

This was referenced Jul 20, 2016

Failed to join the user and pid namespaces of an existing runc container #960

Closed

Fix setting SELinux label for mqueue when user namespaces are enabled #959

Closed

hqhq reviewed Jul 22, 2016
View reviewed changes

mlaventure reviewed Jul 22, 2016
View reviewed changes

cyphar mentioned this pull request Jul 22, 2016

test_runtime.sh: Add a user namespace opencontainers/runtime-tools#114

Closed

avagin reviewed Jul 27, 2016
View reviewed changes

wking mentioned this pull request Jul 27, 2016

Add Runtime CLI Spec opencontainers/runtime-spec#513

Closed

cyphar added this to the 1.0.0 milestone Aug 1, 2016

This was referenced Aug 3, 2016

Joining existing pid ns not reparenting process #971

Open

--pid=container:<id> does not reparent zombies to pid 1 moby/moby#25348

Open

hqhq reviewed Aug 8, 2016
View reviewed changes

This was referenced Aug 8, 2016

nsenter: set {uid,gid} explicitly around namespace creation #975

Closed

nsenter: correctly handle pidns orphaning #976

Closed

nsenter: guarantee correct user namespace ordering #977

Merged

dqminh reviewed Aug 12, 2016
View reviewed changes

nsenter: simplify netlink parsing

4e72ffc

This just moves everything to one function so we don't have to pass a bunch of things to functions when there's no real benefit. It also makes the API nicer. Signed-off-by: Aleksa Sarai <asarai@suse.de>

mrunalp merged commit aee3f6f into opencontainers:master Aug 16, 2016

cyphar deleted the cleanup-nsenter branch August 17, 2016 08:09

AkihiroSuda mentioned this pull request Jan 23, 2018

make: validate C format #1699

Merged

nsenter: major cleanups #950

nsenter: major cleanups #950

Conversation

cyphar commented Jul 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crosbymichael commented Jul 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Jul 18, 2016

Choose a reason for hiding this comment

cyphar Jul 18, 2016 • edited Loading

Choose a reason for hiding this comment

mlaventure commented Jul 19, 2016

cyphar commented Jul 19, 2016 • edited Loading

Choose a reason for hiding this comment

cyphar Jul 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Jul 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Jul 27, 2016

cyphar commented Jul 27, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haiyanmeng commented Aug 1, 2016

mrunalp commented Aug 1, 2016

cyphar commented Aug 3, 2016

cyphar commented Aug 3, 2016

cyphar commented Aug 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hqhq commented Aug 8, 2016

cyphar commented Aug 8, 2016 • edited Loading

cyphar commented Aug 9, 2016

hqhq commented Aug 12, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dqminh commented Aug 12, 2016 • edited by caniszczyk Loading

cyphar commented Aug 12, 2016

dqminh commented Aug 12, 2016 • edited by caniszczyk Loading

mrunalp commented Aug 12, 2016

cyphar commented Aug 13, 2016 • edited Loading

mrunalp commented Aug 16, 2016 • edited by caniszczyk Loading

cyphar commented Aug 16, 2016

avagin commented Aug 16, 2016 • edited by caniszczyk Loading

mrunalp commented Aug 16, 2016 • edited by caniszczyk Loading

cyphar commented Aug 17, 2016

cyphar commented Jul 18, 2016 •

edited

Loading

cyphar Jul 18, 2016 •

edited

Loading

cyphar commented Jul 19, 2016 •

edited

Loading

cyphar Jul 22, 2016 •

edited

Loading

cyphar commented Jul 22, 2016 •

edited

Loading

cyphar commented Jul 27, 2016 •

edited

Loading

cyphar commented Aug 8, 2016 •

edited

Loading

hqhq commented Aug 12, 2016 •

edited

Loading

dqminh commented Aug 12, 2016 •

edited by caniszczyk

Loading

dqminh commented Aug 12, 2016 •

edited by caniszczyk

Loading

cyphar commented Aug 13, 2016 •

edited

Loading

mrunalp commented Aug 16, 2016 •

edited by caniszczyk

Loading

avagin commented Aug 16, 2016 •

edited by caniszczyk

Loading

mrunalp commented Aug 16, 2016 •

edited by caniszczyk

Loading