Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add panic guards to callbacks #412

Merged
merged 2 commits into from
Aug 9, 2023

Conversation

spencercw
Copy link
Contributor

As discussed in #410. Mostly copied from the equivalent code in android-activity. Resolves #268.

Copy link
Member

@MarijnS95 MarijnS95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

ndk/src/utils.rs Outdated Show resolved Hide resolved
ndk/src/utils.rs Outdated
Comment on lines 34 to 36
// Use the Rust logger if installed and enabled, otherwise fall back to the Android system
// logger so there is at least some record of the panic
let use_log = log_enabled!(Level::Error);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rib we might want to port this back to android-activity.

ndk/src/utils.rs Outdated Show resolved Hide resolved
@MarijnS95 MarijnS95 merged commit ca8adb8 into rust-mobile:master Aug 9, 2023
18 checks passed
@spencercw spencercw deleted the abort-on-panic branch August 9, 2023 09:46
@MarijnS95
Copy link
Member

@spencercw not sure if worth/relevant here, but I learned about resume_unwind() today which is able to resume an unwind (via a type-erased boxed error) after we've caught it before entering an FFI boundary. Assuming we can sometimes channel (a raw pointer to) that boxed error back to the call site - and the callback is fired somewhere within Rust code - we could resume the unwind instead of plainly aborting.

Not that it adds much though; it's basically a panic as if you were to unwrap() the boxed error, but without invoking the panic hook, so no extra stacktrace is printed (but unwinding and drop handlers should continue?).

For example, I already want to port this code to android-activity and I think we had a use-case there where we use an FFI layer which is just Rust on both sides (or we deduced that the use of C symbols to go Rust -> Rust wasn't actually an FFI boundary - or there were just a few frames of non-Rust), and it might come in handy there.


Final note: I think we forgot to wrap the callback in Looper in this new helper :)
(And this is where we could - but probably not worth it - set aside the boxed error and resume_unwind() it in the poll functions)

MarijnS95 added a commit that referenced this pull request Aug 16, 2023
As with #412 we shouldn't let panics unwind into the FFI boundary; use
the new helper `abort_on_panic()` utility to catch these and abort the
process instead.
@spencercw
Copy link
Contributor Author

resume_unwind can be useful, but you need to be reasonably confident that you are actually going to be called somewhere that you can resume the panic. The MediaCodec async callback for example is tricky because if the callback function panics, then it's likely that the user's main thread won't be woken up and won't call any other MediaCodec functions, so the panic would be swallowed and the application will likely hang.

I do actually use resume_unwind quite a lot in my code to transport panics from worker threads. I have the main thread receiving events from the worker thread on a channel. If the recv fails then I know the worker thread must have exited, so at that point I join the thread which returns the panic in the Err, which I can then resume_unwind. This works nicely because the main thread is automatically woken up when the worker thread panics, just by virtue of the Sender being dropped when it goes out of scope.

I'm not super familiar with Looper, but it sounds like this is a scenario where resume_unwind could work.

MarijnS95 added a commit that referenced this pull request Aug 23, 2023
As with #412 we shouldn't let panics unwind into the FFI boundary; use
the new helper `abort_on_panic()` utility to catch these and abort the
process instead.
@MarijnS95
Copy link
Member

MarijnS95 commented Aug 31, 2023

Exactly, if there's no clear point where the callback is going to be called this won't work.

What's your thought about not getting a second backtrace from the point you called resume_unwind() back to the start of your application?


For Looper I am not sure either. There are these specific poll*() functions that report back whether they executed a callback, and because the looper is associated with a thread I think those callbacks run synchronously when the user calls those poll*() functions. In turn the NDK wrapper can simply check if an Err was stored after poll*() returns.


EDIT: https://developer.android.com/ndk/reference/group/looper#alooper_addfd

This method can be called on any thread. This method may block briefly if it needs to wake the poll.

Not sure if that means "any thread, as long as it's the thread that calls _poll*().


However, there are APIs like AndroidBitmap_compress() which call a callback (I think, not specified in the docs!) only for the duration of the function. This could simply be a Rust closure that stores the panic reason externally so that we can propagate it when the FFI function returns.

@spencercw
Copy link
Contributor Author

What's your thought about not getting a second backtrace from the point you called resume_unwind() back to the start of your application?

Not quite sure what you mean by this.

This method can be called on any thread. This method may block briefly if it needs to wake the poll.

Not sure if that means "any thread, as long as it's the thread that calls _poll*().

I presume this means you can submit file descriptors to a looper from any thread, but the polling will happen on the thread that the looper is actually running on.

However, there are APIs like AndroidBitmap_compress() which call a callback (I think, not specified in the docs!) only for the duration of the function. This could simply be a Rust closure that stores the panic reason externally so that we can propagate it when the FFI function returns.

This does seem like an ideal use case for resume_unwind.

@MarijnS95
Copy link
Member

Apologies for the late reply, it's been a while since I could find time.

Not quite sure what you mean by this.

resume_unwind() intentionally avoids the panic handler, which is the mechanism that prints panic traces to the log. When being used, the panic trace from a panic!() up to catch_unwind() is being printed, but the trace from resume_unwind() up to the entry point is not. For that we pretty much need to panic!()/unwrap() the returned boxed error, which may or may not be pretty/desired?

I presume this means you can submit file descriptors to a looper from any thread, but the polling will happen on the thread that the looper is actually running on.

Likely, but there's no wording excluding that. We could test to find out but no guarantees... I'll keep this in mind regardless as this API needs to be revisited anyway.

This does seem like an ideal use case for resume_unwind.

It is, hoping to submit the PR for this soon including mappings for DataSpace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aaudio callbacks are unsound
2 participants