Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSX linker segfaulting on Travis #38878

Closed
alexcrichton opened this issue Jan 6, 2017 · 13 comments
Closed

OSX linker segfaulting on Travis #38878

alexcrichton opened this issue Jan 6, 2017 · 13 comments
Labels
A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) O-macos Operating system: macOS

Comments

@alexcrichton
Copy link
Member

I've seen this quite a lot recently

Example logs:

clang: error: unable to execute command: Segmentation fault: 11
clang: error: linker command failed due to signal (use -v to see invocation)

Example Travis runs:

I'm opening a tracking issue so we can collect some more logs and hopefully draw conclusions from them at some point. Until then I'm not really sure how we'd deal with this...

@alexcrichton alexcrichton added O-macos Operating system: macOS A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) labels Jan 6, 2017
@Mark-Simulacrum
Copy link
Member

Is there a way to collect the coredump from the segfault so we could attempt to track down the reason behind the segfault? Perhaps we could at least pass -v to clang so we could try to reproduce locally?

@alexcrichton
Copy link
Member Author

@Mark-Simulacrum your guess is as good as mine!

@sfackler
Copy link
Member

If you set ulimit -c unlimited, the core dump will end up in /cores.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Jan 12, 2017
This commit attempts to debug the segfaults that we've been seeing on OSX on
Travis. I have no idea what's going on here mostly, but let's try to look at
core dumps and get backtraces to see what's going on. This commit itself is
mostly a complete shot in the dark, I'm not sure if this even works...

cc rust-lang#38878
bors added a commit that referenced this issue Jan 13, 2017
travis: Attempt to debug OSX linker segfaults

This commit attempts to debug the segfaults that we've been seeing on OSX on
Travis. I have no idea what's going on here mostly, but let's try to look at
core dumps and get backtraces to see what's going on. This commit itself is
mostly a complete shot in the dark, I'm not sure if this even works...

cc #38878
bors added a commit that referenced this issue Jan 14, 2017
travis: Attempt to debug OSX linker segfaults

This commit attempts to debug the segfaults that we've been seeing on OSX on
Travis. I have no idea what's going on here mostly, but let's try to look at
core dumps and get backtraces to see what's going on. This commit itself is
mostly a complete shot in the dark, I'm not sure if this even works...

cc #38878
@alexcrichton
Copy link
Member Author

https://travis-ci.org/rust-lang/rust/jobs/193795162 is the first job where we got a stack trace:

Core file '/cores/core.31933' (x86_64) was loaded.
(lldb) command source -s 0 'cmds'
Executing commands in '/Users/travis/build/rust-lang/rust/cmds'.
(lldb) bt all
* thread #1: tid = 0x0000, 0x00007fffaed8519d libsystem_c.dylib`__cxa_finalize_ranges + 369, stop reason = signal SIGSTOP
  * frame #0: 0x00007fffaed8519d libsystem_c.dylib`__cxa_finalize_ranges + 369

  thread #2: tid = 0x0001, 0x000000010f9fe5b4 dyld`ImageLoaderMachO::findClosestSymbol(mach_header const*, void const*, void const**) + 264, stop reason = signal SIGSTOP
    frame #0: 0x000000010f9fe5b4 dyld`ImageLoaderMachO::findClosestSymbol(mach_header const*, void const*, void const**) + 264
    frame #1: 0x000000010f9f5444 dyld`dladdr + 133
    frame #2: 0x00007fffaeced99c libdyld.dylib`dladdr + 72
    frame #3: 0x0000000100316647 ld`__assert_rtn + 207
    frame #4: 0x00000001003653c4 ld`ld::tool::InputFiles::parseWorkerThread() + 696
    frame #5: 0x00007fffaef07aab libsystem_pthread.dylib`_pthread_body + 180
    frame #6: 0x00007fffaef079f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #7: 0x00007fffaef07221 libsystem_pthread.dylib`thread_start + 13

I wouldn't necessarily call that... illuminating

@Mark-Simulacrum
Copy link
Member

I wonder if there would be a way to print what the files we're linking are? Maybe that would help since maybe the linker segfaults on an improperly formatted file or something like that; knowing what the files are (names and lengths) may help. I think passing -v to clang would be good enough, at least as a start.

@alexcrichton
Copy link
Member Author

PRs are always welcome! I don't have any magical tricks up my sleeves to implement tricks like that unfortunately.

@alexcrichton
Copy link
Member Author

Next successful stack trace: https://travis-ci.org/rust-lang/rust/jobs/194499380

Core file '/cores/core.33216' (x86_64) was loaded.
(lldb) command source -s 0 'cmds'
Executing commands in '/Users/travis/build/rust-lang/rust/cmds'.
(lldb) bt all
* thread #1: tid = 0x0000, 0x00007fffbb6ec19d libsystem_c.dylib`__cxa_finalize_ranges + 369, stop reason = signal SIGSTOP
    frame #0: 0x00007fffbb6ec19d libsystem_c.dylib`__cxa_finalize_ranges + 369
* thread #2: tid = 0x0001, 0x00007fffbb786756 libsystem_kernel.dylib`close + 10, stop reason = signal SIGSTOP
    frame #0: 0x00007fffbb786756 libsystem_kernel.dylib`close + 10
    frame #1: 0x0000000106869c10 ld`Snapshot::createSnapshot() + 270
    frame #2: 0x00000001067ac5da ld`__assert_rtn + 98
    frame #3: 0x00000001067fb3c4 ld`ld::tool::InputFiles::parseWorkerThread() + 696
    frame #4: 0x00007fffbb86eaab libsystem_pthread.dylib`_pthread_body + 180
    frame #5: 0x00007fffbb86e9f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #6: 0x00007fffbb86e221 libsystem_pthread.dylib`thread_start + 13

@alexcrichton
Copy link
Member Author

Well the pthreads explains why it's nondeterministic at least...

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 3, 2017
This is a complete random shot in the dark to help suppress the OSX linker
segfaults being found on rust-lang#38878. The segfault happens apparently during an
assertion in [this source file][1]. That apparently is related to a worker
thread pool for parsing a bunch of object files. Presumably there's some
concurrency bug triggering the segfault?

Poking around the source to see if we could disable this multithreading behavior
didn't turn up many results, but one check in the [file above][1] was related to
`_options.pipelineEnabled()` which seemed suspicious. That in turn is read from
[this file] in the `fPipelineFifo` instance variable (if it's non-null).

That instance variable is in turn set from [another file][3] as a result of
`getenv("LD_PIPELINE_FIFO")`. This PR now sets that env var for all builders,
including the OSX ones.

Will this help? I have no idea! But it at least seems related and hopefully
isn't too hard to try out and/or back out.

[1]: https://opensource.apple.com/source/ld64/ld64-274.2/src/ld/InputFiles.cpp.auto.html
[2]: https://opensource.apple.com/source/ld64/ld64-274.2/src/ld/Options.h.auto.html
[3]: https://opensource.apple.com/source/ld64/ld64-274.2/src/ld/Options.cpp.auto.html
@alexcrichton
Copy link
Member Author

Random attempt to help this: #40243

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 10, 2017
This is a last-ditch attempt to help our pain with dealing with rust-lang#38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug rust-lang#38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to rust-lang#38878.
bors added a commit that referenced this issue Mar 10, 2017
rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with #38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug #38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to #38878.
arielb1 pushed a commit to arielb1/rust that referenced this issue Mar 10, 2017
… r=arielb1

rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with rust-lang#38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug rust-lang#38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to rust-lang#38878.
arielb1 pushed a commit to arielb1/rust that referenced this issue Mar 10, 2017
… r=arielb1

rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with rust-lang#38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug rust-lang#38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to rust-lang#38878.
alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 11, 2017
… r=arielb1

rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with rust-lang#38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug rust-lang#38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to rust-lang#38878.
bors added a commit that referenced this issue Mar 11, 2017
rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with #38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug #38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to #38878.
@alexcrichton
Copy link
Member Author

Looks like #40422 did the trick, we haven't seen this in ~2 weeks, so closing.

bors added a commit that referenced this issue Nov 18, 2017
Fix #38878 again — restart linker when seeing SIGBUS in additional to SIGSEGV.

In #45985 (comment) we see a linker crashed due to Bus Error (signal 10) on macOS. The error was not caught by #40422 since the PR only handles Segmentation Fault (signal 11). The crash log indicates the problem is the same as #38878, so we just amend #40422 to include SIGBUS as well.

(Additionally, modified how the crash logs are printed so that irrelevant logs are truly filtered out.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) O-macos Operating system: macOS
Projects
None yet
Development

No branches or pull requests

5 participants