-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc(feature): 0063 SDK Crash Monitoring #63
Conversation
Proposal for detecting crashes caused by our SDKs to improve reliability.
text/0063-sdk-crash-monitoring.md
Outdated
|
||
### Cons <a name="option-1-cons"></a> | ||
|
||
1. Please add your cons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some SDKs remove the sentry frames, such as Dart, Java and likely more.
The reason is, our frames are always part of the stack trace since its responsible for capturing and parsing it, we could start sending them and hiding them on the product tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also do something similar on Cocoa, but not for unhandled errors. Do you do that as well for unhandled errors, @marandaneto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, Sentry frames are never there.
text/0063-sdk-crash-monitoring.md
Outdated
|
||
### Option 1: Detect during event processing <a name="option-1"></a> | ||
|
||
During event processing, after processing the stacktrace, the server detects if a crash stems from any of our SDKs by looking at the top frames of the stacktrace. If the server detects that it does, it could duplicate the event and store it in a special-cased sentry org where each SDK gets its project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maintain data locality with this approach (ie. the hybrid cloud project)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it could be solved by actually posting HTTP requests to sentry.sentry.io
to create the new issues (i.e. all derived errors go to the same region), but that might be problematic for performance. We can always heavily sample the number of derived errors we send though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
### Option 2: Detect in SDKs <a name="option-2"></a> | ||
|
||
When the SDK sends a crash event to Sentry, it checks the stacktrace and checks if the crash stems from the SDK itself by looking at the top frames of the stacktrace. If it does, the SDK also sends the event to a special-cased sentry org, where each SDK gets its project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 2 seems the most straight forward to implement, but it would require a public DSN for every SDK project, which might get polluted by third-party forks of SDKs.
|
||
During event processing, after processing the stacktrace, the server detects if a crash stems from any of our SDKs by looking at the top frames of the stacktrace. If the server detects that it does, it duplicates the event and stores it in a special-cased sentry org where each SDK gets its project. | ||
|
||
A good candidate to add this functionality is the `event_manager`. Similarly, where we call [`_detect_performance_problems`](https://github.com/getsentry/sentry/blob/4525f70a1fb521445bbb4c9250b2e15e05b059c3/src/sentry/event_manager.py#L2461), we could add an extra function called `detect_sdk_crashes`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 1 seems to be the only viable option if we want this to evolve into a general by-product for library / SDK maintainers (see "Background"). However, if all we care about in the foreseeable future are sentry SDK crashes, then Option 2 is probably easier to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to evolve into a general by-product for library / SDK maintainers
Yes, one of the main ideas would be that Sentry SDK maintainers get it for free.
My 2 cents are, I vote for option 1. Both options 2 & 3 produce a lot of busy work, that yet again is very hard to align across all SDKs and even then will not get it right a 100%. I think one con to add to Option 1 is: no 100% guarantee - but one can argue this is the case for all options. The way this could work is as simple as having a check:
|
It's necessary for us to make sure we have consent from the client to opt-in to sending these reports to us. My first thought was a toggle on the project settings in the Sentry dashboard. Thoughts? |
re @mdtro do we really tho? or not just a counter - but we only look and care about our code. |
Looking at this strictly from the SDK monitoring PoV, a big pro of option 1 IMO is that detection isn't performed on the SDK side. Large part of the rationale for this initiative is to find out if SDKs do something wrong which we are not aware about, and adding code in the SDK to solve this seems a bit counter intuitive and could have a lot of limitations and drawbacks. |
One drawback that is not written yet is that this would only work for errors that the SDK is still able to operate/send the event to Sentry. |
Co-authored-by: Karl Heinz Struggl <kahest@users.noreply.github.com>
@marandaneto, I think our automated tests must surface such severe bugs. The goal of this proposal is definitely not to help to surface such severe bugs. This solution will be useful for edge case crashes hard to detect during CI. I clarified this with 9eeb549. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great outcome, thanks to all participants!
Proposal for detecting crashes caused by our SDKs to improve reliability.
Rendered RFC