Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(feature): 0063 SDK Crash Monitoring #63

Merged
merged 20 commits into from
Jan 27, 2023

Conversation

philipphofmann
Copy link
Member

@philipphofmann philipphofmann commented Jan 18, 2023

Proposal for detecting crashes caused by our SDKs to improve reliability.

Rendered RFC

Proposal for detecting crashes caused by our SDKs to improve
reliability.

### Cons <a name="option-1-cons"></a>

1. Please add your cons.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some SDKs remove the sentry frames, such as Dart, Java and likely more.
The reason is, our frames are always part of the stack trace since its responsible for capturing and parsing it, we could start sending them and hiding them on the product tho.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also do something similar on Cocoa, but not for unhandled errors. Do you do that as well for unhandled errors, @marandaneto?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, Sentry frames are never there.

@philipphofmann philipphofmann marked this pull request as draft January 18, 2023 10:15

### Option 1: Detect during event processing <a name="option-1"></a>

During event processing, after processing the stacktrace, the server detects if a crash stems from any of our SDKs by looking at the top frames of the stacktrace. If the server detects that it does, it could duplicate the event and store it in a special-cased sentry org where each SDK gets its project.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maintain data locality with this approach (ie. the hybrid cloud project)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, @mdtro. That's a problem to solve. @jjbayer, can you maybe answer that question ⬆️ ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it could be solved by actually posting HTTP requests to sentry.sentry.io to create the new issues (i.e. all derived errors go to the same region), but that might be problematic for performance. We can always heavily sample the number of derived errors we send though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdtr @jjbayer, why do we need data locality?

@philipphofmann philipphofmann marked this pull request as ready for review January 19, 2023 08:57
text/0063-sdk-crash-monitoring.md Show resolved Hide resolved
text/0063-sdk-crash-monitoring.md Outdated Show resolved Hide resolved

### Option 2: Detect in SDKs <a name="option-2"></a>

When the SDK sends a crash event to Sentry, it checks the stacktrace and checks if the crash stems from the SDK itself by looking at the top frames of the stacktrace. If it does, the SDK also sends the event to a special-cased sentry org, where each SDK gets its project.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 2 seems the most straight forward to implement, but it would require a public DSN for every SDK project, which might get polluted by third-party forks of SDKs.

text/0063-sdk-crash-monitoring.md Show resolved Hide resolved

During event processing, after processing the stacktrace, the server detects if a crash stems from any of our SDKs by looking at the top frames of the stacktrace. If the server detects that it does, it duplicates the event and stores it in a special-cased sentry org where each SDK gets its project.

A good candidate to add this functionality is the `event_manager`. Similarly, where we call [`_detect_performance_problems`](https://github.com/getsentry/sentry/blob/4525f70a1fb521445bbb4c9250b2e15e05b059c3/src/sentry/event_manager.py#L2461), we could add an extra function called `detect_sdk_crashes`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 1 seems to be the only viable option if we want this to evolve into a general by-product for library / SDK maintainers (see "Background"). However, if all we care about in the foreseeable future are sentry SDK crashes, then Option 2 is probably easier to implement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to evolve into a general by-product for library / SDK maintainers

Yes, one of the main ideas would be that Sentry SDK maintainers get it for free.

@HazAT
Copy link
Member

HazAT commented Jan 19, 2023

My 2 cents are, I vote for option 1.

Both options 2 & 3 produce a lot of busy work, that yet again is very hard to align across all SDKs and even then will not get it right a 100%.

I think one con to add to Option 1 is: no 100% guarantee - but one can argue this is the case for all options.

The way this could work is as simple as having a check:

if sdk == sentry.javascript.browser 
and
frames.contain "@sentry"
return true

@mdtro
Copy link
Member

mdtro commented Jan 19, 2023

It's necessary for us to make sure we have consent from the client to opt-in to sending these reports to us. My first thought was a toggle on the project settings in the Sentry dashboard. Thoughts?

@HazAT
Copy link
Member

HazAT commented Jan 19, 2023

re @mdtro do we really tho?
We are already processing their stack traces, we are just increasing a counter if the top frame is from our library.
I don't see why we need to ask anyone for permission.

or not just a counter - but we only look and care about our code.

@kahest
Copy link
Member

kahest commented Jan 23, 2023

Looking at this strictly from the SDK monitoring PoV, a big pro of option 1 IMO is that detection isn't performed on the SDK side. Large part of the rationale for this initiative is to find out if SDKs do something wrong which we are not aware about, and adding code in the SDK to solve this seems a bit counter intuitive and could have a lot of limitations and drawbacks.

@marandaneto
Copy link
Contributor

One drawback that is not written yet is that this would only work for errors that the SDK is still able to operate/send the event to Sentry.
When the SDK has a major bug that is in the transport layer or something, the SDK won't be able to even send the event.
Just trying to understand if this is really beneficial and would help a lot or just for minor issues.

@philipphofmann
Copy link
Member Author

philipphofmann commented Jan 25, 2023

One drawback that is not written yet is that this would only work for errors that the SDK is still able to operate/send the event to Sentry. When the SDK has a major bug that is in the transport layer or something, the SDK won't be able to even send the event. Just trying to understand if this is really beneficial and would help a lot or just for minor issues.

@marandaneto, I think our automated tests must surface such severe bugs. The goal of this proposal is definitely not to help to surface such severe bugs. This solution will be useful for edge case crashes hard to detect during CI. I clarified this with 9eeb549.

Copy link
Member

@kahest kahest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great outcome, thanks to all participants!

@philipphofmann philipphofmann merged commit 6b50854 into main Jan 27, 2023
@philipphofmann philipphofmann deleted the rfc/sdk-crash-monitoring branch January 27, 2023 10:57
@mitsuhiko mitsuhiko restored the rfc/sdk-crash-monitoring branch February 3, 2023 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants