Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsenter: major cleanups #950

Merged
merged 2 commits into from
Aug 16, 2016
Merged

nsenter: major cleanups #950

merged 2 commits into from
Aug 16, 2016

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Jul 18, 2016

Just to make the code much easier to read, also removing redundant code. I also added the functionality that setns(2) will fail if we passed the wrong path to the bootstrap process (to avoid bugs in runC's namespace sending being non-fatal).

In addition, I switch pr_perror to bail, where the process will exit with a distinct error code. Currently we aren't handling syncT properly within the bootstrap process. But this is a good stopgap.

There are also lots of other fixes that make the code more robust, easier to read and much nicer to maintain. Some of this code comes from the rootless containers PR #774.

TODO:

  • Make errors distinctive (still not perfect but good enough for now).
  • Simplify netlink parsing code.
  • ❌ Refactor pipes within nsenter and with bootstrap process. (not relevant here, there's only one pipe for bootstrap and one for the inter-process stuff).

There are other cleanup parts in #975 and #976 and #977.

Signed-off-by: Aleksa Sarai asarai@suse.de

* List of netlink message types sent to us as part of bootstrapping the init.
* These constants are defined in libcontainer/message_linux.go.
*/
#define INIT_MSG 62000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: some alignment issues here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I set my editor to a tabsize of 4 by accident.

@crosbymichael
Copy link
Member

For all your formatting needs...

https://github.com/crosbymichael/vim-cfmt

exit(1);
}
/* Sync with parent. */
if (read(syncpipe[0], &s, sizeof(s)) != sizeof(s) || s != SYNC_VAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add back the extras () for easier reading?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it isn't easier to read with more parenthesis.

@cyphar
Copy link
Member Author

cyphar commented Jul 18, 2016

@crosbymichael Cheers, I've added that to my dotfiles. Rebased. :P

}

static int clone_parent(jmp_buf *env, int flags) __attribute__((noinline));
static int clone_parent(jmp_buf *env, int flags)
static int clone_parent(jmp_buf * env, int flags) __attribute__ ((noinline));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space between * and env :/

Copy link
Member Author

@cyphar cyphar Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like GNU indent doesn't know what the kernel style is. Fixed.

@mlaventure
Copy link
Contributor

I like the new write_file take 👍

Although, the commit message will need to be changed before merging :)

@cyphar
Copy link
Member Author

cyphar commented Jul 19, 2016

Of course, it's still a wip because I still don't like bits of how the id mapping management are done. There's also some more code that I want to pull out of the rootless containers patchset to put it here. :P

}
len = write(fd, data, data_len);
if (len != data_len)
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we close fd before return?

Copy link
Member Author

@cyphar cyphar Jul 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently thinking about that. In all of the error cases, we are going to exit so I don't think it really matters that we close all of the file descriptors. I'm currently trying to figure out how alloca interacts with clone (hint: the answer is "not nicely") but I might also end up just adding a memory leak because we're going to execve anyway (which clears all of the allocated memory from the old process anyway).

I might end up implementing it as a bunch of gotos with error strings but that would mean that we lose the benefit of exit(__COUNTER__ + 1).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise you could return -(len != data_len) that way close(fd) will always be called

@cyphar
Copy link
Member Author

cyphar commented Jul 22, 2016

Alright. I think I've included all of the refactorings and improvements that I wanted to. That was not as much fun as it looked. PTAL. Please make sure you test it on your local machine, as there's been quite a bit of oddness with our testbed recently.

/cc @crosbymichael @hqhq @mrunalp @dqminh


s = SYNC_USERMAP_ACK;
if (write(syncfd, &s, sizeof(s)) != sizeof(s)) {
kill(child, SIGKILL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitpid()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? We didn't waitpid before, and I don't think we can because the processes are cloned with CLONE_PARENT | SIGCHLD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, overlooked the CLONE_PARENT, sorry!

@cyphar
Copy link
Member Author

cyphar commented Jul 27, 2016

/ping @opencontainers/runc-maintainers PTAL

@cyphar
Copy link
Member Author

cyphar commented Jul 27, 2016

From this week's meeting, @crosbymichael suggested that we include a refactoring of the communication methods between the bootstrap process and the init. This would involve presumably figuring out how we should deal with both the netlink magic as well as the other bits and pieces.

Also, this is probably going to be the base for the console handling rewrite. So there's that. 😉


/* Retrieve data. */
size = NLMSG_PAYLOAD(&hdr, 0);
current = data = malloc(size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check that malloc doesn't return NULL here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix'd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL check still missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar I don't see a NULL check here..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dammit, I must've accidentally put the fix in a different patch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There, actually fixed this time.

@cyphar cyphar added this to the 1.0.0 milestone Aug 1, 2016
@haiyanmeng
Copy link
Contributor

@cyphar , @mrunalp , I tested this PR on Fodora 23, and make test finished successfully.

I also tested this PR on RHEL7, and the same failure happened as described in #915. However, I do not think the failure is due to this PR.
It seems that, on RHEL7, once a process joins another process's user namespace, it can not
run clone with the CLONE_NEWNS flag.

@mrunalp
Copy link
Contributor

mrunalp commented Aug 1, 2016

@cyphar I tested the PR. It doesn't fix #959.

@cyphar
Copy link
Member Author

cyphar commented Aug 3, 2016

@mrunalp I believe this is related to @avagin's points about the changing of uid and gids. We need to set them to the root in the user namespace before unshare(USER) and then set them again after the unshare(USER) (before anything else). Working on it now.

@cyphar
Copy link
Member Author

cyphar commented Aug 3, 2016

1, 2, 3 seems a bit complex and logically aren't they 3 identical set of euid/guid being set ? Can we just set them once before we do any clone/setns ? We can derive the euid/egid from:

If we're not using user namespaces, they're all equivalent. If we are using user namespaces, they are not equivalent.

The first one changes us to have euid = root in the namespace we're joining. WIthout this you get security vulnerabilities (as @avagin mentioned) because you have euid = (kuid 0) and a racing process could ptrace it and start executing code as a root process.

The second one does a similar thing, but for a new mapping that we're creating (it's the same sort of argument). And the third one is required to unshare the other namespaces (your user must be mapped if you're using SELinux or some security policies).

In fact, I think the fourth one isn't necessary.

@cyphar
Copy link
Member Author

cyphar commented Aug 3, 2016

@opencontainers/runc-maintainers :: Okay, @dqminh and I had a discussion about the mutual exclusivity of setns(2) and unshare(2). The current way nsexec is written, they are not enforced to be exclusive. Personally I think changing join_namespaces to make them mutually exclusive would make the code quite ugly (you'd need to parse the namespace string in order to figure out the right UID to change to, which would require more allocations).

Would everyone be fine if we enforce this in nl_parse (like we do cloneflags). Or what if we didn't enforce it in C at all (since we can enforce it in Go)? I peronally prefer enforcing it in nl_parse as a policy decision rather than changing the way we join namespaces just to avoid this particular case (which may be useful in the feature).

config->consolefd = open(current, O_RDWR);
if (config->consolefd < 0)
bail("failed to open console %s", current);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced in this PR, I have a question about consolePath, is there any specific reasons that we have to handle consolePath in nsexec.c? It's only used by exec process, why can't we handle it in go code like we do for init process?

If possible, I think we should do so to make c code as simple as possible. @crosbymichael

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hqhq It's because we setns inside the C code, and we can't be sure of where the slave path will be inside the container's mount namespace (or even if there will be a path to it) we need to open it before we join the container's namespaces. This isn't a problem for when we create a container, because nothing changes after an unshare.

All of this will be reworked as part of the --console rewrite in #814. But first we need to merge some of this code that will lay the groundwork for that other stuff. :P

@hqhq
Copy link
Contributor

hqhq commented Aug 8, 2016

@cyphar I think you can split this PR to speed up the process, the clean up part is a big change but can be merged rapidly, it'll also make other people easier to review and give other maintainers more confidence to merge :)

@cyphar
Copy link
Member Author

cyphar commented Aug 8, 2016

@hqhq Fair point. :P I will break up this PR into three parts:

  1. Cleanups. (nsenter: major cleanups #950)
  2. The user namespace ordering and similar fixes (nsenter: guarantee correct user namespace ordering #977).
  3. The setuid fixes. (nsenter: set {uid,gid} explicitly around namespace creation #975)
  4. Fixes for the orphaning and similar issues (still a WIP). (nsenter: correctly handle pidns orphaning #976)

This PR blew up quite quickly. Hopefully the commit will rebase nicely. :P

@cyphar
Copy link
Member Author

cyphar commented Aug 9, 2016

Okay, I think this PR is done (in terms of development). The rest of the changes will happen in the separate split PRs based on this one (see #975, #976, #977). PTAL so we can merge this finally.

@opencontainers/runc-maintainers

@hqhq
Copy link
Contributor

hqhq commented Aug 12, 2016

LGTM, ping @mrunalp @crosbymichael @avagin @dqminh

Approved with PullApprove

child = clone(child_func, ca.stack_ptr, CLONE_PARENT | SIGCHLD | flags, &ca);

/*
* On old kernels, CLONE_PARENT didn't work with CLONE_NEWPID, so we have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we note an example of the last known kernel that exhibits this problem so maybe eventually we can remove this if possible ?
If we want to support those kernels, it might also make sense to setup CI for it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the commit that fixed this: torvalds/linux@1f7f4dd. And this is the commit which caused the issue: torvalds/linux@40a0d32.

The only mainline kernel which this affected is 3.12, but I can't really comment on kernels that this change might've been backported to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix'd.

@dqminh
Copy link
Contributor

dqminh commented Aug 12, 2016

LGTM except a small nit

Approved with PullApprove

@cyphar
Copy link
Member Author

cyphar commented Aug 12, 2016

Alright. Time for the final round of LGTMs.

/cc @opencontainers/runc-maintainers

Removed a lot of clutter, improved the style of the code, removed
unnecessary complexity. In addition, made errors unique by making bail()
exit with a unique error code. Most of this code comes from the current
state of the rootless containers branch.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@dqminh
Copy link
Contributor

dqminh commented Aug 12, 2016

LGTM

Approved with PullApprove

@mrunalp
Copy link
Contributor

mrunalp commented Aug 12, 2016

@cyphar Where did the fix for #959 end up?

@cyphar
Copy link
Member Author

cyphar commented Aug 13, 2016

@mrunalp It's in #975. The changes required to implement it require #977 (which is a pretty big change in of itself), whic his why I've split it like I have. This PR currently just cleans up the error handling, style and some other issues.

@mrunalp
Copy link
Contributor

mrunalp commented Aug 16, 2016

LGTM besides one comment about null check.

Approved with PullApprove

This just moves everything to one function so we don't have to pass a
bunch of things to functions when there's no real benefit. It also makes
the API nicer.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar
Copy link
Member Author

cyphar commented Aug 16, 2016

@mrunalp Fixed the NULL check.

/cc @opencontainers/runc-maintainers

@avagin
Copy link
Contributor

avagin commented Aug 16, 2016

LGTM

Approved with PullApprove

1 similar comment
@mrunalp
Copy link
Contributor

mrunalp commented Aug 16, 2016

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit aee3f6f into opencontainers:master Aug 16, 2016
@cyphar
Copy link
Member Author

cyphar commented Aug 17, 2016

🎉 Now for the other PRs. :D

@cyphar cyphar deleted the cleanup-nsenter branch August 17, 2016 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants