-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use createdump to collect crash dumps where possible in runtime #65422
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-infrastructure-libraries Issue DetailsSystem dumps on macOS are large - uploading them has been taking down helix queues. Using the runtime's coredump features should allow for configuration such that we can get smaller and still useful dumps. I think we should just be able to update the libraries runner template to set the dump configuration environment variables for runtime/eng/testing/RunnerTemplate.sh Lines 130 to 136 in 8cd701a
See also:
|
I recommend setting the following env vars:
This enables a heap dump which has everything needed to diagnose most managed and native problems. The path can also contain these special name formatting chars:
|
Type 2 was still somewhat big when we looked at it. It's the best fidelity, but it definitely can get big and we need to improve the doc on debugging coredumps that we include. |
They are not going to be as big as a full or system dump (especially on MacOS). The other choices are that great. The "normal" and "triage" dump types only have enough for triaging a problem: only the "pe", "clrstack" and "clrthreads" commands will work for managed code. There should be enough for native stacks and threads though.
|
We often need full dumps to investigate crashes that only happen intermittently in the CI. Should the problem be rather solved by throttling the dump uploads? If a PR generates many dumps or if many PRs generate same dump, skip uploading them. The system should be designed to handle and gracefully recover from situations when we suddenly end up with large volume of crash dumps. There are many ways we can end up in a situation like that. It may be even worth it to create a weekly chaos monkey job that tries to flood the system with many big dumps to validate that it is not killing the system. |
That change is upcoming. They are capping it in 2 ways. Upload time capped, and total dump size capped to 6 gb. They have the telemetry to say in helix 6 gb is what's safely uploadable while still being able to do the work and report results without timing out. There's two issues still there. The first one is the disk can still get full if many tests in a work item crash. A more concerning one is 6 gb is big, but not crazy for a macOS system dump. CreateDump is a little better here, but often still too big, mini with private memory seems like what we want from a diagnosibility perspective. I just don't know yet if that is going to cap us. I guess the best to do here is run a few experiments. |
Tagging subscribers to this area: @hoyosjs Issue DetailsSystem dumps on macOS are large - uploading them has been taking down helix queues. Using the runtime's coredump features should allow for configuration such that we can get smaller and still useful dumps. Since createdump is part of the test/payload, I think we should just be able to update the libraries runner template to set the dump configuration environment variables for runtime/eng/testing/RunnerTemplate.sh Lines 130 to 136 in 8cd701a
See also: cc @danmoseley
|
Tagging subscribers to this area: @dotnet/runtime-infrastructure Issue DetailsSystem dumps on macOS are large - uploading them has been taking down helix queues. Using the runtime's coredump features should allow for configuration such that we can get smaller and still useful dumps. Since createdump is part of the test/payload, I think we should just be able to update the libraries runner template to set the dump configuration environment variables for runtime/eng/testing/RunnerTemplate.sh Lines 130 to 136 in 8cd701a
See also: cc @danmoseley
|
@hoyosjs would it make since to try to do this with the libraries crash symbolization effort? What's involved? |
For crashes of libraries? It would be setting #65422 (comment) these variables if not present in the wrapper such that they store the dumps in the folder that helix uploads. You can then symbolize all different dumps. For macOS, Jeremy is already staging work for it https://github.com/dotnet/runtime/pull/92967/files |
I meant - are you planning on re-enabling dumps for the places it was disabled? Perhaps by adding these settings. In cases where it still might be too expensive to pull the dumps off the machine maybe @ivdiazsa's tool might be used to just dump the relevant info to the log. |
System dumps on macOS are large - uploading them has been taking down helix queues. Using the runtime's coredump features should allow for configuration such that we can get smaller and still useful dumps.
Since createdump is part of the test/payload, I think we should just be able to update the libraries runner template to set the dump configuration environment variables for
DbgEnableMiniDump
,DbgMiniDumpName
, andDbgMiniDumpType
(cc @mikem8361 @hoyosjs) instead of using ulimit:runtime/eng/testing/RunnerTemplate.sh
Lines 130 to 136 in 8cd701a
See also:
#65405 (comment)
https://github.com/dotnet/core-eng/issues/15333
cc @danmoseley
The text was updated successfully, but these errors were encountered: