Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove MountFlags in systemd unit to allow shared mount propagation #22806

Merged
merged 1 commit into from
Jun 2, 2016

Conversation

errordeveloper
Copy link
Contributor

@thaJeztah
Copy link
Member

Thanks! This may need some eyes to check if there's possible side-effects, so here goes;

ping @vbatts @rhatdan @alexlarsson (hi!) @runcom @cpuguy83 @AkihiroSuda @rhvgoyal ptal ❤️

@rhvgoyal
Copy link
Contributor

Two things.

  • I am wondering what's the advantage of using a separate namespace with "shared" mount propagation properties. Can we run docker daemon in host mount namespace and that should cut down on number of propagations which happen between host mount namespace and daemon mount namespace.
  • We will most likely also have to look into enabling devicemapper deferred device removal and deferred device deletion at some point of time. Reason being that daemon mounts will be visible on host and there are chances that it can leak into other namespace. And that means that mount point
    or underlying device can be busy when container is exiting or being removed. And that operation will fail. Deferred removal and deferred deletion will help with it.

@rhvgoyal
Copy link
Contributor

BTW, In fedora systemd unit file we have removed MountFlags completely. That means docker daemon is now running in host mount namespace.

@rhvgoyal
Copy link
Contributor

cc @rhatdan

@errordeveloper
Copy link
Contributor Author

BTW, In fedora systemd unit file we have removed MountFlags completely. That means docker daemon is now running in host mount namespace.

This make a lot of sense, happy to consider to implement that instead of what I did here.

My primary concern is about running containerised kubelet, and secondary is that propagation flags feature is actually unusable with systemd unit as is right.

@rhvgoyal
Copy link
Contributor

Right. We wanted to use mount propagation flag feature too to be able to do mounts from inside the containers and be visible in host mount namespace and that's the primary reason to run docker daemon in host mount namespace.

@rhatdan
Copy link
Contributor

rhatdan commented May 18, 2016

We have moved docker daemon on RHEL and Fedora to run in the same mount namespace as the host system, in order to allow mount propagation to work.

@vbatts
Copy link
Contributor

vbatts commented May 18, 2016

Yeah. I would just remove the MountFlags= entirely. It was added due to container mounts showing in the host's mount and, and then if another pid unshared (I.e. cups) then it would hold open the mountinfo. Im guessing that could still be an issue, but forcing the daemon pid and all children to propagate was causing more issues. The only best way to handle this is with directly invoking and not having a daemon in the middle, but that isn't how things are structured.

@thaJeztah
Copy link
Member

What are the user-facing side effects of removing the MountFlags? Anything that should be documented if we decide to go that way?

@rhvgoyal
Copy link
Contributor

@thaJeztah One user facing impact is with devicemapper as I already mentioned. You might get device busy errors more often w.r.t container exits and deletion. So for these users we will have to ask them to use deferred removal and deferred deletion capabilities. If it works well, we might want to make it default. (Right now these are disabled by default and one has to opt in).

We enable it by default on fedora/rhel with the help of docker-storage-setup.

@AkihiroSuda
Copy link
Member

@thaJeztah LGTM(non-binding) for removing the line

@LK4D4
Copy link
Contributor

LK4D4 commented May 23, 2016

@errordeveloper do you think it's better just to remove MountFlags?

@errordeveloper
Copy link
Contributor Author

I'll test if removing the flag works for my use-case.

Signed-off-by: Ilya Dmitrichenko <errordeveloper@gmail.com>
@errordeveloper
Copy link
Contributor Author

I've tested it and it works for me with containerised kubelet, update the commit.

@errordeveloper errordeveloper changed the title Change MountFlags in systemd unit to allow shared mount propagation Remove MountFlags in systemd unit to allow shared mount propagation May 24, 2016
@thaJeztah
Copy link
Member

LGTM

1 similar comment
@tianon
Copy link
Member

tianon commented Jun 2, 2016

LGTM

@andyxning
Copy link

andyxning commented Dec 21, 2016

Just for reference, MountFlags=slave will not work properly with --live-restore when docker restarts. The existing docker containers network info will be failed to restore.

EDIT

Our OS distro is Debian Jessie with 3.16.x kernel.

@rhatdan
Copy link
Contributor

rhatdan commented Dec 21, 2016

Yes on RHEL this will be a problem until we fix the underlying kernel issues.

@rhatdan
Copy link
Contributor

rhatdan commented Mar 2, 2017

I guess we should have reverted this when we found additional issues in RHEL systems.

This article explains what the problems are.
https://access.redhat.com/articles/2938171

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 2, 2017

@rhatdan It might be better to handle the mount ns setup in dockerd for finer-grained control and have it work the same regardless of how docker is started.

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 2, 2017

@rhatdan Would love to work on this so we can get it right.

@rhvgoyal
Copy link
Contributor

rhvgoyal commented Mar 2, 2017

@cpuguy83 Or may be leave it to caller. By running in a private namespace, we also lose some features. For example, shared volume proagation does not does not propagate all the way to host. --live-restore is another issue.

So looks like bunch of features expect docker daemon to be running into host mount namespace. At the same time it can cause problems if those mount points leak somewhere else.

Given that I have received so many complaints of device being busy with devicemapper driver, a part of me says that docker should always be run in a separate mount namespace to reduce the possibility of accidental mount point leaks and sacrifice the some features.

@thaJeztah
Copy link
Member

PR opened here; #31490

robertgzr added a commit to balena-os/meta-balena that referenced this pull request May 12, 2020
* MountFlags=slave
  preventing us from using live restore functionality
  moby/moby#22806 (comment)
  https://access.redhat.com/articles/2938171

* LimitNOFILE=infinity LimitNPROC=infinity
  not-insignificant performance overhead due to
  limits being propagated to all children (containerd + containers)
  moby/moby@8db6109

* Delegate=yes
  allow docker to manage it's cgroup subtree without systemd
  interference
  moby/moby#20152
  moby/moby@d16737f

* TasksMax=infinity
  prevent systemd from setting a default task limit of 512 on the engine
  cgroup, on linux >=4.3
  systemd/systemd#1239
  systemd/systemd#1886
robertgzr added a commit to balena-os/meta-balena that referenced this pull request May 14, 2020
* MountFlags=slave
  preventing us from using live restore functionality
  moby/moby#22806 (comment)
  https://access.redhat.com/articles/2938171

* LimitNOFILE=infinity LimitNPROC=infinity
  not-insignificant performance overhead due to
  limits being propagated to all children (containerd + containers)
  moby/moby@8db6109

* Delegate=yes
  allow docker to manage it's cgroup subtree without systemd
  interference
  moby/moby#20152
  moby/moby@d16737f

* TasksMax=infinity
  prevent systemd from setting a default task limit of 512 on the engine
  cgroup, on linux >=4.3
  systemd/systemd#1239
  systemd/systemd#1886
robertgzr added a commit to balena-os/meta-balena that referenced this pull request May 27, 2020
* MountFlags=slave
  preventing us from using live restore functionality
  moby/moby#22806 (comment)
  https://access.redhat.com/articles/2938171

* LimitNOFILE=infinity LimitNPROC=infinity
  not-insignificant performance overhead due to
  limits being propagated to all children (containerd + containers)
  moby/moby@8db6109

* Delegate=yes
  allow docker to manage it's cgroup subtree without systemd
  interference
  moby/moby#20152
  moby/moby@d16737f

* TasksMax=infinity
  prevent systemd from setting a default task limit of 512 on the engine
  cgroup, on linux >=4.3
  systemd/systemd#1239
  systemd/systemd#1886
robertgzr added a commit to balena-os/meta-balena that referenced this pull request May 27, 2020
* MountFlags=slave
  preventing us from using live restore functionality
  moby/moby#22806 (comment)
  https://access.redhat.com/articles/2938171

* LimitNOFILE=infinity LimitNPROC=infinity
  not-insignificant performance overhead due to
  limits being propagated to all children (containerd + containers)
  moby/moby@8db6109

* Delegate=yes
  allow docker to manage it's cgroup subtree without systemd
  interference
  moby/moby#20152
  moby/moby@d16737f

* TasksMax=infinity
  prevent systemd from setting a default task limit of 512 on the engine
  cgroup, on linux >=4.3
  systemd/systemd#1239
  systemd/systemd#1886

Signed-off-by: Robert Günzler <robertg@balena.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.