Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(feature): 0063 SDK Crash Monitoring #63

Merged
merged 20 commits into from
Jan 27, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ This repository contains RFCs and DACIs. Lost?
- [0042-gocd-succeeds-freight-as-our-cd-solution](text/0042-gocd-succeeds-freight-as-our-cd-solution.md): Plan to replace freight with GoCD
- [0047-introduce-profile-context](text/0047-introduce-profile-context.md): Add Profile Context
- [0048-move-replayid-out-of-tags](text/0048-move-replayid-out-of-tags.md): Plan to replace freight with GoCD
- [0063-sdk-crash-monitoring](text/0063-sdk-crash-monitoring.md): SDK Crash Monitoring
81 changes: 81 additions & 0 deletions text/0063-sdk-crash-monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# SDK Crash Monitoring

- Start Date: 2023-01-18
- RFC Type: feature
- RFC PR: [#33](https://github.com/getsentry/rfcs/pull/33)
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved
- RFC Status: draft
- RFC Driver: [Philipp Hofmann](https://github.com/philipphofmann)
- RFC Approver: tbd

## Summary

This RFC aims to detect crashes caused by our SDKs to improve reliability.

## Motivation

As an APM company, the reliability of our SDKs is one of our most essential quality goals. If our SDK breaks the customer, we fail. Our SDK philosophy refers to this as [degrade gracefully](https://develop.sentry.dev/sdk/philosophy/#degrade-gracefully).

For some SDKs, like mobile SDKs, we primarily rely on users to reach out when our SDKs cause crashes cause we don't operate them in production. If users don't report them, we are unaware. Instead, we should detect crashes caused by our SDKs when they happen so we can proactively fix them.

## Background

The Google Play SDK Console provides insights into crashes for SDK maintainers. We regularly use it for the Android/Java SDK. While it would be great also to build something similar for SDK maintainers within Sentry, it's a bit complicated cause of PII and such. Narrowing down the scope to only Sentry SDKs makes the problem easier to solve.

## Options Considered

- [Option 1: SDKs report FIOMT as errors](#option-1)
- [Option 2: SDKs report FIOMT as performance issues](#option-2)
- [Option 3: Ingest reports FIOMT as performance issues](#option-3)
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved

### Option 1: Detect during event processing <a name="option-1"></a>

During event processing, after processing the stacktrace, the server detects if a crash stems from any of our SDKs by looking at the top frames of the stacktrace. If the server detects that it does, it could duplicate the event and store it in a special-cased sentry org where each SDK gets its project.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maintain data locality with this approach (ie. the hybrid cloud project)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, @mdtro. That's a problem to solve. @jjbayer, can you maybe answer that question ⬆️ ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it could be solved by actually posting HTTP requests to sentry.sentry.io to create the new issues (i.e. all derived errors go to the same region), but that might be problematic for performance. We can always heavily sample the number of derived errors we send though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdtr @jjbayer, why do we need data locality?


A good candidate to add this functionality is the `event_manager`. Similarly, where we call [`_detect_performance_problems`](https://github.com/getsentry/sentry/blob/4525f70a1fb521445bbb4c9250b2e15e05b059c3/src/sentry/event_manager.py#L2461), we could add an extra function called `detect_sdk_crashes`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 1 seems to be the only viable option if we want this to evolve into a general by-product for library / SDK maintainers (see "Background"). However, if all we care about in the foreseeable future are sentry SDK crashes, then Option 2 is probably easier to implement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to evolve into a general by-product for library / SDK maintainers

Yes, one of the main ideas would be that Sentry SDK maintainers get it for free.


As we’d only make this visible to Sentry employees, we might not have to strip out any data cause of data privacy reasons, as employees could view the original events anyways. Of course, employees would need to go through the _admin portal and form.

### Pros <a name="option-1-pros"></a>

1. No need for per-SDK rollout.
2. No extra events or data sent to Sentry.

### Cons <a name="option-1-cons"></a>

1. Please add your cons.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some SDKs remove the sentry frames, such as Dart, Java and likely more.
The reason is, our frames are always part of the stack trace since its responsible for capturing and parsing it, we could start sending them and hiding them on the product tho.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also do something similar on Cocoa, but not for unhandled errors. Do you do that as well for unhandled errors, @marandaneto?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, Sentry frames are never there.


### Option 2: Detect in SDKs <a name="option-2"></a>

When the SDK sends a crash event to Sentry, it checks the stacktrace and checks if the crash stems from the SDK itself by looking at the top frames of the stacktrace. If it does, the SDK duplicates the event and sends it to a special-cased sentry org where each SDK gets its project.

### Cons <a name="option-2-cons"></a>
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved

1. Opposite of [Pro 1 of option 1](#option-1-pro).
2. Extra events sent from SDKs to Sentry.
3. Changing the DSN needs an SDK update.

### Option 3: Client Reports <a name="option-3"></a>
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved

Similar to option 2, we use client reports instead of sending events. We would need the entire event to get enough context to fix a crash. So basically, we would add the crash event to client reports.

### Pros

1. Opposite of [Con 2 of option 2](#option-2-cons).

### Cons

1. Opposite of [Pro 1 of option 1](#option-1-pro).
2. [Con 3 of option 2](#option-2-cons).
3. Extend the protocol of client reports.
4. `The client reports feature doesn't expect 100 percent correct numbers, and it is acceptable for the SDKs to lose a small number of client reports`, [develop docs](https://develop.sentry.dev/sdk/client-reports/#sdk-side-recommendations). So the SDK might drop some crashes, but maybe that's acceptable.
5. `Bugs in our SDKs are out of scope for client reports and are not being tracked using client reports at the moment`, [develop-docs](https://develop.sentry.dev/sdk/client-reports/#basic-operation).
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved


## Drawbacks

Why should we not do this? What are the drawbacks of this RFC or a particular option if
multiple options are presented.

## Unresolved questions

- Do we need to strip sensitive values from the events because of PII or is going through the _admin portal enough?
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved