Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror failure #80666

Closed
tannergooding opened this issue Jan 15, 2023 · 21 comments · Fixed by #106040
Closed

tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror failure #80666

tannergooding opened this issue Jan 15, 2023 · 21 comments · Fixed by #106040
Assignees
Labels
area-System.Diagnostics.Tracing blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs bug disabled-test The test is disabled in source code against the issue in-pr There is an active PR which will close this issue when it is merged Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro) os-mac-os-x macOS aka OSX
Milestone

Comments

@tannergooding
Copy link
Member

tannergooding commented Jan 15, 2023

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=137937
Build error leg or test failing: Build / coreclr Pri1 Runtime Tests Run osx x64 checked / Send tests to Helix (Unix)
Pull request: #80626

Error Message

Fill the error message using known issues guidance.

{
  "ErrorPattern": "eventsourceerror\\.sh:.*Segmentation fault",
  "BuildRetry": false
}

Known issue validation

Build: 🔎
Result validation: ⚠️ Validation could not be done without an Azure DevOps build URL on the issue. Please add it to the "Build: 🔎" line.
Validation performed at: 7/21/2024 2:00:18 AM UTC

Report

Build Definition Test Pull Request
766419 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd
757010 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd
756145 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105535
754023 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd
753599 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105372
752189 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd
751644 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #101143
751444 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105236
751086 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105260
750999 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105284
750809 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd
750696 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #104399
750584 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #104999
750239 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105030
749625 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd
749490 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105178
749184 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105194
749057 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105127
747806 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105131
747753 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105129
747532 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #105109

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 1 21
@tannergooding tannergooding added blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs Known Build Error Use this to report build issues in the .NET Helix tab labels Jan 15, 2023
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 15, 2023
@ghost
Copy link

ghost commented Jan 16, 2023

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=137937
Build error leg or test failing: Build / coreclr Pri1 Runtime Tests Run osx x64 checked / Send tests to Helix (Unix)
Pull request: #80626

Error Message

Fill the error message using known issues guidance.

{
  "ErrorMessage": "Invalid process id: task_for_pid",
  "BuildRetry": false
}

Report

Build Definition Test Pull Request
138819 dotnet/runtime PayloadGroup0.WorkItemExecution #77540
138583 dotnet/runtime PayloadGroup0.WorkItemExecution #77386
138323 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh
138272 dotnet/runtime PayloadGroup0.WorkItemExecution
138234 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh
138050 dotnet/runtime JIT.1.WorkItemExecution
137951 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh #79709
137934 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh
137930 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh #80639
137894 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh
137820 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh #74314
137671 dotnet/runtime tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh
137613 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution #66551

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
5 13 13
Author: tannergooding
Assignees: -
Labels:

area-System.Diagnostics.Tracing, untriaged, blocking-outerloop, Known Build Error

Milestone: -

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jan 17, 2023
@tommcdon tommcdon added this to the 8.0.0 milestone Jan 17, 2023
@tommcdon
Copy link
Member

Test output indicates that the test is crashing with a seg fault and createdump fails either because the process is gone or there is a signing/entitlement issue. @hoyosjs @mikem8361

    tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh [FAIL]
      [createdump] Invalid process id: task_for_pid(15770) FAILED (os/kern) failure (5)
      [createdump] This failure may be because createdump or the application is not properly signed and entitled.
      [createdump] Failure took 0ms
      [createdump] waitpid() returned successfully (wstatus 0000ff00)
      /private/tmp/helix/working/A8400938/w/B79B09BE/e/tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh: line 441: 15770 Segmentation fault: 11  (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

@hoyosjs
Copy link
Member

hoyosjs commented Jan 19, 2023

It's an entitlement issue. The regex is looking for a string that can hit on any osx coreclr crash. Adjusting the string since the issue is also not OSX specific.

@EgorBo
Copy link
Member

EgorBo commented Jan 24, 2023

Can we disable the test until it's fixed? UPD: #81094

@jkotas jkotas added the disabled-test The test is disabled in source code against the issue label Jan 24, 2023
@jkotas
Copy link
Member

jkotas commented Jun 28, 2024

Stacktrace of the above crash (read of invalid GC handle or GC handle table corruption):

(gdb) bt
#0  0xf768b656 in Object::GetGCSafeMethodTable (this=0x7) at /__w/1/s/src/coreclr/vm/object.h:449
#1  Object::ValidateInner (this=0x7, bDeep=1, bVerifyNextHeader=1, bVerifySyncBlock=1)
    at /__w/1/s/src/coreclr/vm/object.cpp:549
#2  0xf768c86c in Object::Validate (this=0x7, bDeep=1, bVerifyNextHeader=1, bVerifySyncBlock=1)
    at /__w/1/s/src/coreclr/vm/object.cpp:529
#3  OBJECTREF::OBJECTREF (this=0xf6afeab0, pObject=0x7) at /__w/1/s/src/coreclr/vm/object.cpp:1157
#4  0xf777104c in ObjectFromHandle (handle=0xf6c1119c) at /__w/1/s/src/coreclr/vm/gchandleutilities.h:42
#5  MarshalNative::GCHandleInternalGet (handle=0xf6c1119c) at /__w/1/s/src/coreclr/vm/marshalnative.cpp:395

@tommcdon tommcdon added this to the 9.0.0 milestone Jun 28, 2024
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jun 28, 2024
@mdh1418
Copy link
Member

mdh1418 commented Jul 17, 2024

culprit.txt
coredump.362.dmp.crashreport.json

The dump grabbed from the sigsegv crash wasn't useful

0:002> k

# Child-SP RetAddr      Call Site00 00000000 00000000     0x72
 
0:002> .lastevent

Last event: 16a.16c: Signal SIGSEGV (Segmentation fault) code SEGV_MAPERR (Address not mapped to object) at 0x7 (first/second chance not available)

  debugger time: Tue Jul 16 16:13:54.909 2024 (UTC - 4:00)
  
 0:002> .exr -1

ExceptionAddress: 00000007

   ExceptionCode: e0534947 (Linux signal)

  ExceptionFlags: 00000000

NumberParameters: 2

   Parameter[0]: 0000000b

   Parameter[1]: 00000001
   
0:002> .ecxr

Unable to get exception context, HRESULT 0x8000FFFF

From the crashreport, the crashing thread 0x16c corresponds to

Thread Id: 0x16c
      Child SP               IP Call Site
 0xf7e19b20 0xf7b27614 libc.so.6!
 0xf7e19b28 0xf7b919c0 libc.so.6!__wait4_time64
 0xf7e19b90 0xf7a26254 libcoreclr.so!PROCCreateCrashDump(std::vector<char const*, std::allocator<char const*> >&, char*, int, bool) at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:2308:22
 0xf7e19bd8 0xf7a27054 libcoreclr.so!PROCCreateCrashDumpIfEnabled at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:2526:9
 0xf7e19c28 0xf79dbaf6 libcoreclr.so!invoke_previous_action(sigaction*, int, siginfo_t*, void*, bool) at /__w/1/s/src/coreclr/pal/src/exception/signal.cpp:422:13
 0xf7e19c60 0xf79db080 libcoreclr.so!sigsegv_handler(int, siginfo_t*, void*) at /__w/1/s/src/coreclr/pal/src/exception/signal.cpp:0:5
 0xf7e19c90 0xf7b371b0 libc.so.6!

which corresponds to waiting for the createdump child process to finish, which according to the logs, returned successfully waitpid() returned successfully (wstatus 00000000) WEXITSTATUS 0 WTERMSIG 0.


From the first native thread, it looks like this crash is occuring during

Task.WaitAll(readerTask, stopTask);
which calls WaitAllCore and WaitAllCoreBlocking below

Thread Id: 0x16a
      Child SP               IP Call Site
 0xff9a56e8 0xf7b27614 libc.so.6!
 0xff9a56f0 0xf7b6347e libc.so.6!
 0xff9a5728 0xf7b6355c libc.so.6!
 0xff9a5738 0xf7b6570a libc.so.6!pthread_cond_wait
 0xff9a57b8 0xf7a18d20 libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:489:17
 0xff9a5818 0xf7a18430 libcoreclr.so!CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:0:0
 0xff9a5868 0xf7a1fb92 libcoreclr.so!CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:0:45
 0xff9a5950 0xf7a203fc libcoreclr.so!WaitForMultipleObjectsEx at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:204:5
 0xff9a5998 0xf76c50b8 libcoreclr.so!Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode, void*) at /__w/1/s/src/coreclr/vm/threads.cpp:0:12
 0xff9a5a28 0xf76bf7c2 libcoreclr.so!Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*)::$_0::operator()(Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*)::__EEParam*) const at /__w/1/s/src/coreclr/vm/threads.cpp:3058:23
 (inlined by) Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) at /__w/1/s/src/coreclr/vm/threads.cpp:3073:5
 0xff9a5aa0 0xf77db20c libcoreclr.so!CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) at /__w/1/s/src/coreclr/vm/synch.cpp:0:30
 0xff9a5b00 0xf76c631a libcoreclr.so!Thread::Wait(CLREvent*, int, PendingSync*) at /__w/1/s/src/coreclr/vm/threads.cpp:3875:5
 (inlined by) Thread::Block(int, PendingSync*) at /__w/1/s/src/coreclr/vm/threads.cpp:3829:13
 0xff9a5b18 0xf76bdb4e libcoreclr.so!SyncBlock::Wait(int) at /__w/1/s/src/coreclr/vm/syncblk.cpp:0:0
 0xff9a5bc8 0xf799ac16 libcoreclr.so!Monitor_Wait at /__w/1/s/src/coreclr/classlibnative/bcltype/objectnative.cpp:0:0
 0xff9a5c28 0xef04c1f8 0xef04c1f8
 0xff9a5c88 0xef05e692 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
 0xff9a5ce8 0xef07f324 System.Threading.Tasks.Task.WaitAllBlockingCore(System.Collections.Generic.List`1<System.Threading.Tasks.Task>, Int32, System.Threading.CancellationToken)
 0xff9a5d20 0xef07edea System.Threading.Tasks.Task.WaitAllCore(System.ReadOnlySpan`1<System.Threading.Tasks.Task>, Int32, System.Threading.CancellationToken)
 0xff9a5d98 0xf624b9ca Tracing.Tests.Common.IpcTraceTest.Validate(Boolean)
 0xff9a5e30 0xf62488c4 Tracing.Tests.Common.IpcTraceTest.RunAndValidateEventCounts(System.Collections.Generic.Dictionary`2<System.String,Tracing.Tests.Common.ExpectedEventCount>, System.Action, System.Collections.Generic.List`1<Microsoft.Diagnostics.NETCore.Client.EventPipeProvider>, Int32, System.Func`2<Microsoft.Diagnostics.Tracing.EventPipeEventSource,System.Func`1<Int32>>, Boolean)
 0xff9a5e68 0xf6248234 Tracing.Tests.EventSourceError.EventSourceError.TestEntryPoint()
 0xff9a5e88 0xf6248042 __GeneratedMainWrapper.Main()

Theres still a discrepancy between IpcTraceTest.cs in dotnet/diagnostics and the version in dotnet/runtime. Is the version in dotnet/diagnostics more resilient, or any other obvious low-hanging improvements we could make to IpcTraceTest.cs in dotnet/runtime to catch/prevent these crashes? @noahfalk @davmason

@davmason davmason removed their assignment Jul 18, 2024
@mdh1418
Copy link
Member

mdh1418 commented Jul 18, 2024

Re-enabled eventsourceerror on non-arm32 unix platforms in #105033

@tommcdon tommcdon removed the disabled-test The test is disabled in source code against the issue label Jul 18, 2024
@jkotas
Copy link
Member

jkotas commented Jul 21, 2024

This crash reproduced in #105127 on osx-x64.

https://dev.azure.com/dnceng-public/public/_build/results?buildId=749057&view=ms.vss-test-web.build-test-results-tab&runId=18949658&resultId=111387&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Log https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-105127-merge-58c14fbe5f844b8793/tracing/1/console.c63dabf5.log?helixlogtype=result :

  2.1s: Saw new provider 'Microsoft-Windows-DotNETRuntimeRundown'
[createdump] Gathering state for process 18485 
[createdump] Crashing thread 333de signal 11 (000b)
[createdump] Writing crash report to file /cores/coredump.18485.dmp.crashreport.json
[createdump] Crash report successfully written
[createdump] Writing minidump with heap to file /cores/coredump.18485.dmp
[createdump] Written 391995656 bytes (95702 pages) to core file
[createdump] Target process is alive
  4.7s: Stopping stream processing
  4.7s: Dropped 0 events
[createdump] Dump successfully written in 2562ms
Expected: 100
Actual: 139
END EXECUTION - FAILED

It is the same problem as in #80666 (comment) (read of invalid GC handle or GC handle table corruption).

@jkotas jkotas added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' disabled-test The test is disabled in source code against the issue os-mac-os-x macOS aka OSX os-linux Linux OS (any supported distro) and removed os-mac-os-x macOS aka OSX labels Jul 21, 2024
@JulieLeeMSFT JulieLeeMSFT added the blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs label Jul 22, 2024
@JulieLeeMSFT
Copy link
Member

failed in outerloop: runtime-coreclr outerloop/20240721.3

Failed configs:

R2R-CG2 osx x64 Checked no_tiered_compilation @ OSX.1200.Amd64.Open
coreclr osx x64 Checked no_tiered_compilation @ OSX.1200.Amd64.Open
- tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd

Log

@mdh1418
Copy link
Member

mdh1418 commented Jul 23, 2024

Based solely on the console logs from the recent failures on OSX, it looks like the readerTask does complete, so it's just stopTask that is causing the crash,

.

@davmason , do you happen to recall any recent modifications to the DiagnosticsClient EventPipeSession that could be causing crashes for specifically the eventsourceerror test?

@stephentoub stephentoub removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Jul 23, 2024
@jkotas
Copy link
Member

jkotas commented Jul 24, 2024

do you happen to recall any recent modifications to the DiagnosticsClient EventPipeSession that could be causing crashes for specifically the eventsourceerror test?

This issue was opened 1.5 years ago. Based on the symptoms, it is an intermittent crash that can come and go due to unrelated changes in timing, etc.

@jkotas
Copy link
Member

jkotas commented Jul 25, 2024

The crash happens at this line:

EventPipeEventProvider _this = (EventPipeEventProvider)GCHandle.FromIntPtr((IntPtr)callbackContext).Target!;

We are crashing because the GCHandle was freed already. We are freeing the GCHandle here:

.

I do not see anything in the implementation of EventPipeInternal.DeleteProvider to guarantee there are no callback in flight when it returns. In fact,

if (enabled ()) {
// Save the provider until the end of the tracing session.
ep_provider_set_delete_deferred (provider, true);
} else {
config_delete_provider (ep_config_get (), provider);
}
explicit defers deleting the provider until the end of the tracing session. I think we may need to defer deleting of the GCHandle in a similar way.

@mdh1418 Does this make sense - could you please take it from here?

@jkotas jkotas added the bug label Jul 25, 2024
@mdh1418
Copy link
Member

mdh1418 commented Jul 26, 2024

Thanks for pinpointing the crash @jkotas!

After some offline discussions with @noahfalk , I'm seeing that ETW has the behavior of blocking during EventProvider.Dispose() until all already in-flight callbacks are finished, so if any callbacks happen to be initiated ⁠after the EventListenersLock in EventProvider.Dispose(), then this p/invoke will block here at ⁠Interop.Advapi32.EventUnregister until the ETW command releases the ETW lock as described in ⁠the comment.

Would parity between ETW and EventPipe in that aspect be better than deferring deleting the GC handle, as in having EventPipeEventProviders Disposal blocked by in flight callbacks? Perhaps for EventPipe, we could invoke Unregister under the EventListenersLock since the original reason for not doing so was because of deadlocking with the ETW lock + EventListenersLock. (edit: since having managed locks do different behaviors for ETW/EventPipe would bring inconsistency).

@jkotas
Copy link
Member

jkotas commented Jul 26, 2024

I do not have an opinion. If you think that synchronous blocking would be better, it is fine with me.

EventPipe, we could invoke Unregister under the EventListenersLock

I think that the synchronous blocking would have to be in the unmanaged event pipe code. I do not see how you can use locks on the managed side to fix this.

@noahfalk
Copy link
Member

Yeah, I want to do it unmanaged code because I want to match behavior between EtwEventProvider.Unregister() and EventPipeProvider.Unregister(). Both of them invoke a p/invoke to do the unregister work and I am recommending we give those p/invokes identical blocking behavior.

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Aug 1, 2024
@mdh1418 mdh1418 removed the in-pr There is an active PR which will close this issue when it is merged label Aug 1, 2024
@mdh1418
Copy link
Member

mdh1418 commented Aug 1, 2024

The attempt to introduce a new lock to coordinate in-flight callbacks with provider disposal broke concurrent callbacks. Still looking to block provider disposal with a weaker blocking behavior

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Aug 6, 2024
@mdh1418
Copy link
Member

mdh1418 commented Aug 6, 2024

Signal wait/set blocking implementation: #106040

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Diagnostics.Tracing blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs bug disabled-test The test is disabled in source code against the issue in-pr There is an active PR which will close this issue when it is merged Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro) os-mac-os-x macOS aka OSX
Projects
None yet