Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beta 1.33 seems to break tarpaulin on multithreading #58104

Closed
elinorbgr opened this issue Feb 3, 2019 · 17 comments
Closed

beta 1.33 seems to break tarpaulin on multithreading #58104

elinorbgr opened this issue Feb 3, 2019 · 17 comments
Assignees
Labels
P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@elinorbgr
Copy link
Contributor

Consider the following minimal project:

Cargo.toml

[package]
name = "tarpotest"
version = "0.1.0"
authors = ["Me <me@me.me>"]
edition = "2018"

[dependencies]
futures-executor = "0.2.1"

src/lib.rs

#[test]
pub fn a() {
    futures_executor::ThreadPool::new();
}

#[test]
pub fn b() {
    futures_executor::ThreadPool::new();
}

Install tarpaulin from crates.io:

RUSTFLAGS="--cfg procmacro2_semver_exempt" cargo install cargo-tarpaulin

And run it on this small project using either stable (rustc 1.32.0 (9fda7c223 2019-01-16)) or beta (rustc 1.33.0-beta.5 (1045131c1 2019-01-31)) rustc:

  • cargo +stable tarpaulin : works as expected
  • cargo +beta tarpaulin: occasionally, (in roughly 15% of the runs), tarpaulin errors as the test executable segfaulted
Error: Failed to run tests! Error: A segfault occurred while executing tests

I have no idea what is at fault here, but apparently something in the last beta of rustc was changed which broke the way tarpaulin does its instrumentation. For more details, the segfault appears to only occur if at least 2 tests in the same binary involve spawning threads, and seem to be more likely the more threads are spawned.

See xd009642/tarpaulin#190 for full details.

@Mark-Simulacrum Mark-Simulacrum added I-nominated T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Feb 13, 2019
@pnkfelix
Copy link
Member

visiting for triage. P-high for initial investigation; assigning that task to @nagisa

@pnkfelix pnkfelix added P-high High priority and removed I-nominated labels Feb 14, 2019
@nagisa
Copy link
Member

nagisa commented Feb 17, 2019

This will be hard for me to reliably bisect because among the failures related to sigsegv I also get

[ERROR tarpaulin] Failed to get test coverage! Error: Failed to run tests! Error: Error occurred when handling ptrace event: ESRCH: No such process

Bisection says that the sigsegv began in 2fba17f (#56837), cc @nikomatsakis @arielb1

The callstack looks like this:

#0  0x0000000000414e4f in futures_executor::thread_pool::ThreadPoolBuilder::create (self=0x7fffffffc170)
    at /home/nagisa/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-executor-0.2.1/src/thread_pool.rs:262
#1  0x0000000000413d6f in futures_executor::thread_pool::ThreadPool::new ()
    at /home/nagisa/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-executor-0.2.1/src/thread_pool.rs:78
#2  0x000000000040424d in tarpotest::b () at src/lib.rs:8
#3  0x0000000000404a4a in tarpotest::b::{{closure}} () at src/lib.rs:7
#4  0x00000000004047ae in core::ops::function::FnOnce::call_once ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libcore/ops/function.rs:231
#5  0x000000000046105f in test::run_test::{{closure}} () at src/libtest/lib.rs:1475
#6  core::ops::function::FnOnce::call_once ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libcore/ops/function.rs:231
#7  <F as alloc::boxed::FnBox<A>>::call_box ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/liballoc/boxed.rs:734
#8  0x00000000004cdbfa in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:92
#9  0x000000000047f188 in std::panicking::try ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libstd/panicking.rs:276
#10 std::panic::catch_unwind () at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libstd/panic.rs:388
#11 test::run_test::run_test_inner::{{closure}} () at src/libtest/lib.rs:1430
#12 0x000000000047eaad in test::run_test::run_test_inner () at src/libtest/lib.rs:1452
#13 0x000000000047d22c in test::run_test () at src/libtest/lib.rs:1471
#14 0x0000000000475c8d in test::run_tests () at src/libtest/lib.rs:1150
#15 test::run_tests_console () at src/libtest/lib.rs:957
#16 0x000000000046d764 in test::test_main () at src/libtest/lib.rs:290
#17 0x000000000046dfb2 in test::test_main_static () at src/libtest/lib.rs:324
#18 0x0000000000404276 in tarpotest::main ()
#19 0x0000000000404ad0 in std::rt::lang_start::{{closure}} ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libstd/rt.rs:64
#20 0x00000000004bbfb3 in std::rt::lang_start_internal::{{closure}} () at src/libstd/rt.rs:49
#21 std::panicking::try::do_call () at src/libstd/panicking.rs:297
#22 0x00000000004cdbfa in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:92
#23 0x00000000004bcbc6 in std::panicking::try () at src/libstd/panicking.rs:276
#24 std::panic::catch_unwind () at src/libstd/panic.rs:388
#25 std::rt::lang_start_internal () at src/libstd/rt.rs:48
#26 0x0000000000404aa9 in std::rt::lang_start (main=0x404260 <tarpotest::main>, argc=2, argv=0x7fffffffdc08)
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libstd/rt.rs:64
#27 0x00000000004042ad in main ()

I won’t be able to look into it much more at the moment.

@nagisa nagisa removed their assignment Feb 17, 2019
@arielb1
Copy link
Contributor

arielb1 commented Feb 17, 2019

I suppose I'll have to investigate this then. Hope I can find enough time.

@pnkfelix
Copy link
Member

triage: speculatively assigning to @arielb1 . They should hopefully unassign themselves if they are unable to look into this.

@Mark-Simulacrum
Copy link
Member

Realistically I expect this to be a regression because we have ~4 days left before we need to produce the stable artifacts.

@arielb1
Copy link
Contributor

arielb1 commented Feb 21, 2019

had some busy time, will try to look at this quickly

@arielb1
Copy link
Contributor

arielb1 commented Feb 22, 2019

This appears to be triggered by the breakpoints tarpaulin places, which makes me suspect there is some race condition here we somehow create in futures. This doesn't sound good.

@arielb1
Copy link
Contributor

arielb1 commented Feb 22, 2019

Looking at the backtrace

0x7fffffffcb40: 0x007d2210      0x00000000      0x00000001      0x00000000
0x7fffffffcb50: 0xffffcff0      0x00000000      0xffffcff0      0x00000000
0x7fffffffcb60: 0xffffcf90      0x00007fff      0x00000000      0x00000000
0x7fffffffcb70: 0x007d0910      0x00000000      0x00000000      0x00000000

   0x00000000004196ca <+90>:    mov    %rdi,0xb8(%rsp)  
   ...
   0x00000000004199c7 <+855>:   mov    0xb8(%rsp),%rcx                                                        
=> 0x00000000004199cf <+863>:   mov    %rax,0x8(%rcx)  

#0  0x00000000004199cf in futures_executor::thread_pool::ThreadPoolBuilder::create (self=0x7fffffffcf90)
    at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-executor-0.2.1/src/thread_pool.rs:262

(gdb) x/8x $rcx
0xffffcff0:     Cannot access memory at address 0xffffcff0

Something is smashing the stack of ThreadPoolBuilder::create (%rcx should contain the address of the return value. which is 0x7fffffffcff0, but it contains just 0xffffcff0).

@arielb1
Copy link
Contributor

arielb1 commented Feb 22, 2019

This is getting weird. The call %rdi is not pointing to the correct place???

Similar crashes:
one crash where the stack is not 16-byte aligned, apparently starting from tarpotest::b::{{closure}}, leading to movaps crashes (how is this happening?).

@arielb1
Copy link
Contributor

arielb1 commented Feb 22, 2019

I smell a bug in tarpaulin itself handling interrupts concurrently.

   0x0000000000403420 <+0>:     sub    $0x18,%rsp
   0x0000000000403424 <+4>:     int3   
   0x0000000000403425 <+5>:     mov    %esp,%edi -- looks suspicious, given that %rdi is truncated
   0x0000000000403427 <+7>:     callq  *0x3c7cdb(%rip)        # 0x7cb108
=> 0x000000000040342d <+13>:    mov    %rsp,%rdi
   0x0000000000403430 <+16>:    callq  0x4036b0 <core::ptr::real_drop_in_place>
   0x0000000000403435 <+21>:    add    $0x18,%rsp
   0x0000000000403439 <+25>:    retq   

Dump of assembler code for function tarpotest::b::{{closure}}:
   0x0000000000403860 <+0>:     push   %rax
   0x0000000000403861 <+1>:     mov    %rdi,(%rsp)
   0x0000000000403865 <+5>:     callq  0x403420 <tarpotest::b>
   0x000000000040386a <+10>:    callq  0x403880 <test::assert_test_result>
   0x000000000040386f <+15>:    pop    %rax
   0x0000000000403870 <+16>:    retq   

@arielb1
Copy link
Contributor

arielb1 commented Feb 22, 2019

It appears that in bad traces, somehow the SIGTRAP signal is never received for the second breakpoint (at 0x403820) despite that byte being an 0xcc, which occurs like 5 instructions after the first one (at 0x403860), which feels weird and hard to be a miscompilation.

bad trace:

.wait status=Stopped(Pid(16463), SIGTRAP)                                                                                                                                                          %rip=403861, at [403420] is ffe789cc18ec8348, rsp=7fffffffd018 rdi=7fffffffd028                                                                                                              wait status=PtraceEvent(Pid(16465), SIGTRAP, 6)                                                                                                                                                              
wait status=Stopped(Pid(16463), SIGTRAP)                                                                                                                                                                     
%rip=403861, at [403420] is ffe789cc18ec8348, rsp=7fffffffd010 rdi=7fffffffd028
writing fbb6e8243c894850>fbb6e8243c894850 to [403860]                         
wait status=Exited(Pid(16465), 0)                                                                   
wait status=PtraceEvent(Pid(16467), SIGTRAP, 6)                              
wait status=PtraceEvent(Pid(16468), SIGTRAP, 6)     
wait status=Exited(Pid(16467), 0)                                            
wait status=Exited(Pid(16468), 0)                                      
wait status=PtraceEvent(Pid(16469), SIGTRAP, 6)                              
wait status=PtraceEvent(Pid(16466), SIGTRAP, 6)  
wait status=PtraceEvent(Pid(16470), SIGTRAP, 6)                               
wait status=PtraceEvent(Pid(16464), SIGTRAP, 6)        
wait status=Exited(Pid(16469), 0)                                              
wait status=PtraceEvent(Pid(16471), SIGTRAP, 6)               
wait status=Exited(Pid(16470), 0)                                         
wait status=Exited(Pid(16471), 0)          
wait status=Exited(Pid(16466), 0)       
wait status=Exited(Pid(16464), 0)
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)                                      
wait status=Stopped(Pid(16473), SIGSTOP)         
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)     
wait status=Stopped(Pid(16474), SIGSTOP)        
wait status=Stopped(Pid(16472), SIGSTOP)                                
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)                                      
wait status=Stopped(Pid(16475), SIGSTOP)                                       
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)  
wait status=Stopped(Pid(16476), SIGSTOP)                                     
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)                   
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)                           
wait status=Stopped(Pid(16478), SIGSTOP)
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)
wait status=Stopped(Pid(16479), SIGSTOP)            
wait status=Stopped(Pid(16463), SIGSEGV)       
[ERROR tarpaulin] Failed to get test coverage! Error: Failed to run tests! Error: A segfault occurred while executing tests

good trace:

.wait status=Stopped(Pid(16438), SIGTRAP)                                                                                                                                                                    
breakpoint at rip=403861                                                                                                                                                                                     
%rip=value is 403861, at [403420] is ffe789cc18ec8348, rsp=7fffffffd018 rdi=7fffffffd028                                                                                                                        
writing fbb6e8243c8948cc>fbb6e8243c894850 to [403860]                                                                                                                                                         
wait status=PtraceEvent(Pid(16440), SIGTRAP, 6)                                                                                                                                                              
wait status=Stopped(Pid(16438), SIGTRAP)                                                                                                                                                                     
breakpoint at rip=403861                                                                                                                                                                                     
%rip=403861, at [403420] is ffe789cc18ec8348, rsp=7fffffffd010 rdi=7fffffffd028                                                                                                                                                                             
writing fbb6e8243c894850>fbb6e8243c894850 to [403860]
wait status=PtraceEvent(Pid(16442), SIGTRAP, 6)                                                                                                                                                              
wait status=Stopped(Pid(16438), SIGTRAP)                                                                                                                                                                     
breakpoint at rip=403425                                                                                                                                                                                     
%rip=403425, at [403420] is ffe789cc18ec8348, rsp=7fffffffcff0 rdi=7fffffffd028                                                                                          
writing ffe789cc18ec8348>ffe7894818ec8348 to [403420]                                                                                                                                                         
wait status=Stopped(Pid(16438), SIGTRAP)                                                                                                                                                                     
breakpoint at rip=403427
...

@arielb1
Copy link
Contributor

arielb1 commented Feb 22, 2019

Reproduction Status

I can reproduce the segfault on 244b05d - before 2fba17f (#56837), which means the bisection was wrong. I used tarpaulin f95f717222e708f6bf0f7c5d2ee0b8f63a9f5c7e (from Jan 21).

This also doesn't feel to me like quite a miscompilation bug, but rather some ptrace oddity in tarpaulin. It would be nice if someone who understands ptrace would look at this.

@nagisa
Copy link
Member

nagisa commented Feb 23, 2019

@arielb1 weird, I even did bisect twice running the reproducer 32 times at each step on each bors merge, and it ended up pointing out the same commit both times. And I was thinking that the PR indeed looked very unlikely to be the actual cause.

I bisected again, based on prebuilt nightlies, twice, running the reproduction 64 times at each step and it points out a range of 2018-12-08 to 2018-12-14 – way before your PR landed. Sorry for the false alert!

I will run bisection on merges between 2018-12-08 and 2018-12-14 now.

@nagisa
Copy link
Member

nagisa commented Feb 23, 2019

Hit 3499575 (#56243). The change seems significantly more relevant, and also something that broke multiple things in the past, mostly pointing to invalid assumptions in users’ code, which also matches @arielb1’s investigation results and feelings.

@arielb1
Copy link
Contributor

arielb1 commented Feb 23, 2019

In that case, there is probably nothing actionable on our side.

@pnkfelix
Copy link
Member

pnkfelix commented Feb 28, 2019

Nominating for discussion at T-compiler meeting. My predisposition is to close this bug as non-actionable.

@pnkfelix
Copy link
Member

pnkfelix commented Apr 4, 2019

closing as non-actionable.

@pnkfelix pnkfelix closed this as completed Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants