Currently, capturing the stack backtrace is done on Windows by calling
into `dbghelp!StackWalkEx` (or `dbghelp!StackWalk64` if the version of
`dbghelp` we loaded is too old to contain that function). This is very
convenient since `StackWalkEx` handles everything for us but there are
two issues with doing so:
1. `dbghelp` is not safe to use from multiple threads at the same time
so all calls into it must be serialized.
2. `StackWalkEx` returns inlined frames as if they were regular stack
frames which requires loading debug info just to walk the stack. As a
result, simply capturing a backtrace without resolving it is much
more expensive on Windows than *nix.
This change rewrites our Windows support to call `RtlVirtualUnwind`
instead on platforms which support this API (`x86_64` and `aarch64`).
This API walks the actual (ie, not inlined) stack frames so it does not
require loading any debug info and is significantly faster. For
platforms that do not support `RtlVirtualUnwind` (ie, `i686`), we fall
back to the current implementation which calls into `dbghelp`.
To recover the inlined frame information when we are asked to resolve
symbols, we use `SymAddrIncludeInlineTrace` to load debug info and
detect inlined frames and then `SymQueryInlineTrace` to get the
appropriate inline context to resolve them.
The result is significant performance improvements to backtrace capture
and symbolizing on Windows!
Before:
```
> cargo +nightly bench
Running benches\benchmarks.rs
running 6 tests
test new ... bench: 658,652 ns/iter (+/- 30,741)
test new_unresolved ... bench: 343,240 ns/iter (+/- 13,108)
test new_unresolved_and_resolve_separate ... bench: 648,890 ns/iter (+/- 31,651)
test trace ... bench: 304,815 ns/iter (+/- 19,633)
test trace_and_resolve_callback ... bench: 463,645 ns/iter (+/- 12,893)
test trace_and_resolve_separate ... bench: 474,290 ns/iter (+/- 73,858)
test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 0 filtered out; finished in 8.26s
```
After:
```
> cargo +nightly bench
Running benches\benchmarks.rs
running 6 tests
test new ... bench: 495,468 ns/iter (+/- 31,215)
test new_unresolved ... bench: 1,241 ns/iter (+/- 251)
test new_unresolved_and_resolve_separate ... bench: 436,730 ns/iter (+/- 32,482)
test trace ... bench: 850 ns/iter (+/- 162)
test trace_and_resolve_callback ... bench: 410,790 ns/iter (+/- 19,424)
test trace_and_resolve_separate ... bench: 408,090 ns/iter (+/- 29,324)
test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 0 filtered out; finished in 7.02s
```
The changes to the symbolize step also allow us to report inlined frames
when resolving from just an instruction address which was not previously
possible.