Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch-run: can't join user namespace of pid <PID>: Invalid argument #1270

Closed
heasterday opened this issue Jan 19, 2022 · 8 comments · Fixed by #1445
Closed

ch-run: can't join user namespace of pid <PID>: Invalid argument #1270

heasterday opened this issue Jan 19, 2022 · 8 comments · Fixed by #1445

Comments

@heasterday
Copy link
Contributor

While doing some prerelease testing I bumped into the follow failure that was intermittent:

ch-run[9314]: error: can't join user namespace of pid 9298: Invalid argument (ch_core.c:347 22)

Running the test suite some more got me this:

   (in test file run/ch-run_join.bats, line 417)
     `cat "${BATS_TMPDIR}/join.2.ns"' failed
   ch-run[5956]: join: peer group size from command line (ch-run.c:273)
   ch-run[5956]: join: peer group tag from command line (ch-run.c:297)
   ch-run[5956]: verbosity: 2 (ch-run.c:183)
   ch-run[5956]: image: /var/tmp/images/chtest (ch-run.c:184)
   ch-run[5956]: newroot: /var/tmp/images/chtest (ch-run.c:185)
   ch-run[5956]: container uid: 29407 (ch-run.c:186)
   ch-run[5956]: container gid: 29407 (ch-run.c:187)
   ch-run[5956]: join: 1 2 bar 0 (ch-run.c:189)
   ch-run[5956]: private /tmp: 0 (ch-run.c:190)
   ch-run[5956]: join: I won (ch_core.c:281)
   ch-run[5956]: join: winner initializing shared data (ch_core.c:307)
   ch-run[5956]: join: 1 peers left excluding myself (ch_core.c:314)
   ch-run[5956]: join: done (ch_core.c:329)
   ch-run[5956]: new $PATH: <PATH>
   ch-run[5956]: executing: /test/printns 5 /tmp/ch-test.tmp.heasterday/bats.tmp/join.1.ns (ch_core.c:366)
   /proc/self/ns/net:<host>.localdomain:4026531956
   /proc/self/ns/uts:<host>.localdomain:4026531838
   /proc/self/ns/ipc:<host>.localdomain:4026531839
   /proc/self/ns/pid:<host>.localdomain:4026531836
   /proc/self/ns/user:<host>.localdomain:4026533692
   /proc/self/ns/mnt:<host>.localdomain:4026533693
   found pid: 5956
   ch-run[5998]: join: peer group size from command line (ch-run.c:273)
   ch-run[5998]: join: peer group tag from command line (ch-run.c:297)
   ch-run[5998]: verbosity: 2 (ch-run.c:183)
   ch-run[5998]: image: /var/tmp/images/chtest (ch-run.c:184)
   ch-run[5998]: newroot: /var/tmp/images/chtest (ch-run.c:185)
   ch-run[5998]: container uid: 29407 (ch-run.c:186)
   ch-run[5998]: container gid: 29407 (ch-run.c:187)
   ch-run[5998]: join: 1 2 bar 0 (ch-run.c:189)
   ch-run[5998]: private /tmp: 0 (ch-run.c:190)
   ch-run[5998]: join: I lost (ch_core.c:285)
   ch-run[5998]: joining namespaces of pid 5956 (ch_core.c:353)
   ch-run[5998]: error: can't join user namespace of pid 5956: Invalid argument (ch_core.c:347 22)
   cat: /tmp/ch-test.tmp.heasterday/bats.tmp/join.2.ns: No such file or directory
@reidpr reidpr changed the title ch-run: can't join user namespace of pid <PID>: Invalid argument ch-run: can't join user namespace of pid <PID>: Invalid argument Jan 20, 2022
@reidpr reidpr self-assigned this Jan 20, 2022
@reidpr reidpr added high and removed medium labels Jan 20, 2022
@reidpr
Copy link
Collaborator

reidpr commented Jan 21, 2022

FWIW, I was not able to reproduce this on either of my Debian VMs; 166 test runs with no error.

@reidpr
Copy link
Collaborator

reidpr commented Jan 21, 2022

@heasterday reproduced it on a CentOS 7 VM with both kernel 3.10.0-1160.31.1.el7.x86_64 and 3.10.0-1160.53.1.el7.x86_64.

@reidpr
Copy link
Collaborator

reidpr commented Jan 21, 2022

@heasterday also reproduced it on CentOS 7 with Charliecloud 0.25, so it's not new.

@heasterday
Copy link
Contributor Author

Summary of my notes so far:

Exhibits the issue:

  • Centos 7: Occurs with 3.10.0-1160.31.1.el7.x86_64 and 3.10.0-1160.53.1.el7.x86_64.
  • Testbed based on RHEL 7: Occurs with 3.10.0-1160.45.1.1

Doesn't exhibit the issue:

  • Centos 8: Does not occur with 4.18.0-348.7.1.el8_5.x86_64
  • Ubuntu 20.04: Does not occur with 5.4.0-91-generic
  • Centos7: Does not occur with with a 5.4 kernel from Elrepo
  • Testbed based on SLES 15: Does not occur with 5.3.18-24.78-default

@heasterday
Copy link
Contributor Author

@reidpr Let me know if there are other tests you would like to see.

@heasterday
Copy link
Contributor Author

heasterday commented Feb 18, 2022

We received an excellent user report on this one that seems to indicate that this bug was introduced by the syslog support, previously I reported that I could reproduce this issue on 0.25 but I suspect I had an issue in my testing environment. Our working suspicion is a race condition where some child spawned by the syslog call hasn't exited by the time we call unshare. This is supported by two things:

  • The issue appears to go away if we disable the syslog support.
  • Adding a sleep() call after the the call to syslog() seems prevent it.

Workaround: disable the syslog support via --disable-syslog at configure time.

@reidpr
Copy link
Collaborator

reidpr commented Feb 18, 2022

I did look at the source code in glibc for syslog(3) and did not see any forking, so I think my earlier hypothesis (noted above) that it was child-related may be wrong, and some other resource is racing.

@heasterday heasterday removed their assignment Aug 24, 2022
@reidpr reidpr added medium and removed high labels Sep 8, 2022
@reidpr reidpr added this to the 0.30 milestone Sep 8, 2022
@reidpr
Copy link
Collaborator

reidpr commented Sep 8, 2022

I can reproduce the bug in a few minutes on my CentOS 7 VM with kernel 3.10.0-1160.71.1.el7.x86_64, with:

$ rm -f /dev/shm/*; while bin/ch-test -b docker --pack-fmt tar-unpack -f test/run/ch-run_join.bats:'ch-run --join: three peers, direct launch'; do true; done

(retyped, so beware typos)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants