Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICEs caused by new fcntl call in jobserver dependency on Linux #88091

Closed
anp opened this issue Aug 16, 2021 · 5 comments · Fixed by rust-lang/cargo#9798
Closed

ICEs caused by new fcntl call in jobserver dependency on Linux #88091

anp opened this issue Aug 16, 2021 · 5 comments · Fixed by rust-lang/cargo#9798
Labels
C-bug Category: This is a bug.

Comments

@anp
Copy link
Member

anp commented Aug 16, 2021

We're currently unable to update Fuchsia's toolchain to the latest nightly with builds failing due to this ICE:

RUST_BACKTRACE=1 ../../prebuilt/third_party/rust/linux-x64/bin/rustc --color=always --crate-name predicates ../../third_party/rust_crates/vendor/predicates/src/lib.rs --crate-type rlib --emit=dep-info=host_arm64-coverage-rust/obj/third_party/rust_crates/libpredicates-18699c217c61a73d.rlib.d,link -Zdep-info-omit-d-target --cap-lints=allow --edition=2018 -Cmetadata=18699c217c61a73d -Cextra-filename=-18699c217c61a73d --cfg=feature=\"default\" --cfg=feature=\"difference\" --cfg=feature=\"float-cmp\" --cfg=feature=\"normalize-line-endings\" --cfg=feature=\"regex\" -Zmutable-noalias=off --cfg=rust_panic=\"unwind\" --cfg=__rust_toolchain=\"HoZdYmTRbpKgjJUDnZHD-heytrg9UsK59r0cE5bC9Q4C\" -Clinker=../../prebuilt/third_party/clang/linux-x64/bin/clang++ -Cdefault-linker-libraries -Clink-arg=-static-libstdc++ -Clink-arg=-Wl,-rpath=\$ORIGIN/ -Clink-arg=--sysroot=../../prebuilt/third_party/sysroot/linux -Clink-arg=--target=aarch64-unknown-linux-gnu -Clink-arg=-stdlib=libc++ -Clink-arg=-unwindlib= -Clink-arg=-rtlib=compiler-rt -Clink-arg=-fuse-ld=lld -Clink-arg=-Wl,--build-id -Copt-level=1 -Cdebuginfo=2 -Zallow-features= --target aarch64-unknown-linux-gnu --cap-lints=deny -Dwarnings -Cdebug-assertions=yes -Clinker=../../prebuilt/third_party/clang/linux-x64/bin/clang++ -Cdefault-linker-libraries -Clink-arg=-static-libstdc++ -Clink-arg=-Wl,-rpath=\$ORIGIN/ -Clink-arg=--sysroot=../../prebuilt/third_party/sysroot/linux -Clink-arg=--target=aarch64-unknown-linux-gnu -Clink-arg=-stdlib=libc++ -Clink-arg=-unwindlib= -Clink-arg=-rtlib=compiler-rt -Clink-arg=-fuse-ld=lld -Clink-arg=-Wl,--build-id -Zinstrument-coverage -Clink-arg=-Wl,-z,nostart-stop-gc -o host_arm64-coverage-rust/obj/third_party/rust_crates/libpredicates-18699c217c61a73d.rlib -Ldependency=host_arm64-coverage-rust/obj/third_party/rust_crates -ldl -lpthread --extern difference=host_arm64-coverage-rust/obj/third_party/rust_crates/libdifference-729c4c18786eef68.rlib --extern float_cmp=host_arm64-coverage-rust/obj/third_party/rust_crates/libfloat_cmp-25495b83f43346cf.rlib --extern normalize_line_endings=host_arm64-coverage-rust/obj/third_party/rust_crates/libnormalize_line_endings-4054253ee9a0f431.rlib --extern predicates_core=host_arm64-coverage-rust/obj/third_party/rust_crates/libpredicates_core-9278ecd8b23c9da2.rlib --extern regex=host_arm64-coverage-rust/obj/third_party/rust_crates/libregex-be3eec66bf66867d.rlib && ../../prebuilt/third_party/python3/linux-x64/bin/python3.8 -S ../../build/gn/verify_depfile.py -t "//third_party/rust_crates:predicates-v1_0_7(//build/toolchain:host_arm64-coverage-rust)" -d host_arm64-coverage-rust/obj/third_party/rust_crates/libpredicates-18699c217c61a73d.rlib.d 
thread 'rustc' panicked at 'failed to create jobserver: Custom { kind: PermissionDenied, error: "failed to increase jobserver pipe capacity from 4096 to 8192; jobserver otherwise might deadlock" }', compiler/rustc_data_structures/src/jobserver.rs:23:38
stack backtrace:
   0: rust_begin_unwind
             at /b/s/w/ir/x/w/rust/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /b/s/w/ir/x/w/rust/library/core/src/panicking.rs:93:14
   2: core::result::unwrap_failed
             at /b/s/w/ir/x/w/rust/library/core/src/result.rs:1617:5
   3: core::ops::function::FnOnce::call_once
   4: std::sync::once::Once::call_once_force::{{closure}}
   5: std::sync::once::Once::call_inner
             at /b/s/w/ir/x/w/rust/library/std/src/sync/once.rs:418:21
   6: rustc_data_structures::jobserver::client
   7: rustc_session::session::build_session
   8: rustc_interface::util::create_session
   9: rustc_interface::interface::create_compiler_and_run
  10: scoped_tls::ScopedKey<T>::set
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
error: internal compiler error: unexpected panic
note: the compiler unexpectedly panicked. this is a bug.
note: we would appreciate a bug report: https://github.com/rust-lang/rust/issues/new?labels=C-bug%2C+I-ICE%2C+T-compiler&template=ice.md
note: rustc 1.56.0-nightly (4e900176b 2021-08-12) running on x86_64-unknown-linux-gnu
note: compiler flags: -Z dep-info-omit-d-target -Z mutable-noalias=off -Z allow-features= -Z instrument-coverage -C linker=../../prebuilt/third_party/clang/linux-x64/bin/clang++ -C default-linker-libraries -C link-arg=-static-libstdc++ -C link-arg=-Wl,-rpath=$ORIGIN/ -C link-arg=--sysroot=../../prebuilt/third_party/sysroot/linux -C link-arg=--target=aarch64-unknown-linux-gnu -C link-arg=-stdlib=libc++ -C link-arg=-unwindlib= -C link-arg=-rtlib=compiler-rt -C link-arg=-fuse-ld=lld -C link-arg=-Wl,--build-id -C opt-level=1 -C debuginfo=2 -C debug-assertions=yes -C linker=../../prebuilt/third_party/clang/linux-x64/bin/clang++ -C default-linker-libraries -C link-arg=-static-libstdc++ -C link-arg=-Wl,-rpath=$ORIGIN/ -C link-arg=--sysroot=../../prebuilt/third_party/sysroot/linux -C link-arg=--target=aarch64-unknown-linux-gnu -C link-arg=-stdlib=libc++ -C link-arg=-unwindlib= -C link-arg=-rtlib=compiler-rt -C link-arg=-fuse-ld=lld -C link-arg=-Wl,--build-id -C link-arg=-Wl,-z,nostart-stop-gc --crate-type rlib
note: some of the compiler flags provided by cargo are hidden
query stack during panic:
end of query stack

We think that this error is occurring on our builders due to many concurrent instances of rustc exhausting /proc/sys/fs/pipe-user-pages-soft which limits the number of pipe pages a user is allowed to have:

The default value for this file is 16384, which permits creating up to 1024 pipes with the default capacity.

It seems that the failure started with e62cd40 which updated the cargo subtree, causing the jobserver dependency for rustc to update to 0.1.23. That release contains rust-lang/jobserver-rs#34 by @alexcrichton which on Linux attempts to reserve necessary space in the pipe for Client::new, returning an io::Error for client creation if that reservation fails.

When accessing the jobserver for internal concurrency control, by default rustc attempts to read the path to a pipe from an environment variable. If no jobserver has been configured, it creates its own pipe via the jobserver crate where the new fcntl call was added.

This fallback is likely never exercised in builds driven by cargo or make because they both provide a jobserver to their subprocesses. On Fuchsia, we invoke rustc using ninja, which does not support the jobserver protocol, so every instance of rustc ends up creating its own pipe. I think based on reading the code that each "fallback pipe" needs the default 2 pages, so we'd be able to run 512 rustcs if no other processes for the user were creating pipes. In practice our build machines are quite large and they also run some other processes which create pipes, so we seem to be able to hit this limit pretty often.

We're considering workarounds including a static limit on concurrent rustc invocations in our build or implementing the jobserver protocol ourselves somehow, but it does seem like it should be possible to invoke rustc concurrently many times without make's protocol. Both of these solutions would come with significant downsides, though.

Is it possible for rustc to work without access to a jobserver client? Could the fallback case be made to use an in-process stub instead of creating an actual jobserver client?

Alternatively, it seems that even though the pipe resizing is failing that the pipe creation itself has not been previously failing in our build. Perhaps we could find a way for rustc to ignore the error from the resizing since it only ever asks for space to schedule 32 tasks in the fallback configuration?

These are the options that came up in a brief internal discussion but other ideas would be welcome!

cc @tmandry @jamuraa

@anp anp added the C-bug Category: This is a bug. label Aug 16, 2021
@ehuss
Copy link
Contributor

ehuss commented Aug 16, 2021

The small crater run we just did here ran into a large number of errors due to this. @alexcrichton What do you think about temporarily reverting the jobserver change until we have a better handle on how to approach this issue?

@alexcrichton
Copy link
Member

Oh dear, thanks for cc'ing me on this!

For some background, the motivation for this change to the jobserver crate was rust-lang/cargo#9739 where on Linux a single-page-backed pipe can quickly deadlock based on . This means that previously all the pipes on your system were theoretically susceptible to deadlocking. Now that being said the deadlocking happens when a token is recycled 4096 times at minimum, and for rustc tokens correspond basically to units of work on codegen units and you're unlikely to ever get close to 4096 codegen units. This means that for rustc's use case specifically it was highly unlikely to actually hit this deadlock in practice.

Now as for solving this bug, I think there's a few possible routes:

  • For rustc's own personal use case, it could choose to ignore this situation and simply forge ahead and ignore the fcntl failure.
  • For rustc's own personal use case, there's no need for process inheritance of jobserver fds, which means that a special in-process variant of the jobserver can be created for rustc if it ends up creating one.
  • Rustc could opt to not use the jobserver at all for your use case. If each rustc generates its own jobserver it's likely you probably don't care much about limiting parallelism.
  • You could update to the latest-and-greatest Linux kernel which should have a patch that fixes this behavior (the default "limited" pipe size is now 2 pages instead of 1 page)
  • You could increase the pipe limits on your build machine so pipes more frequently get the 16-page default backing buffer.
  • We could just back out the changes (as @ehuss mentions) and not really ever put them back in. This is fixed in a future version of Linux and the problem seems very rare otherwise today. If this is coming up enough then it's not worth the hypothetical protection against deadlock it gives.

I had most of this comment written up before I saw the crater report, but I think that for now I'll revert this and we can just deal with it later if we really need to.

@Hello71
Copy link

Hello71 commented Aug 17, 2021

You could update to the latest-and-greatest Linux kernel which should have a patch that fixes this behavior (the default "limited" pipe size is now 2 pages instead of 1 page)

this isn't quite right. the pipe fix was backported back to 4.4 (the earliest kernel still supported upstream), so you need only upgrade to the latest kernel on your branch, which you should do anyways (see gkh "All users must upgrade").

@Hello71
Copy link

Hello71 commented Aug 17, 2021

For rustc's own personal use case, there's no need for process inheritance of jobserver fds, which means that a special in-process variant of the jobserver can be created for rustc if it ends up creating one.

this is probably a good idea. pipes are much slower than shm futexes; the only reason to use pipes as semaphores is to maintain compatibility with GNU make or with platforms that don't support shared semaphores.

@anp
Copy link
Member Author

anp commented Aug 17, 2021

Yeah if it's showing up in crater then my assumption about being limited to use outside of cargo might have been incorrect.

Appreciate the revert, I also filed #88117 as a possible follow-up that might remove this scaling factor entirely.

EDIT: jinx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants