rfc(feature): 0063 SDK Crash Monitoring #63

philipphofmann · 2023-01-18T10:06:35Z

Proposal for detecting crashes caused by our SDKs to improve reliability.

marandaneto · 2023-01-18T10:12:02Z

text/0063-sdk-crash-monitoring.md

+
+### Cons <a name="option-1-cons"></a>
+
+1. Please add your cons.


Some SDKs remove the sentry frames, such as Dart, Java and likely more.
The reason is, our frames are always part of the stack trace since its responsible for capturing and parsing it, we could start sending them and hiding them on the product tho.

We also do something similar on Cocoa, but not for unhandled errors. Do you do that as well for unhandled errors, @marandaneto?

Nope, Sentry frames are never there.

text/0063-sdk-crash-monitoring.md

mdtro · 2023-01-18T19:01:17Z

text/0063-sdk-crash-monitoring.md

+
+### Option 1: Detect during event processing <a name="option-1"></a>
+
+During event processing, after processing the stacktrace, the server detects if a crash stems from any of our SDKs by looking at the top frames of the stacktrace. If the server detects that it does, it could duplicate the event and store it in a special-cased sentry org where each SDK gets its project.


Can we maintain data locality with this approach (ie. the hybrid cloud project)?

Excellent point, @mdtro. That's a problem to solve. @jjbayer, can you maybe answer that question ⬆️ ?

I guess it could be solved by actually posting HTTP requests to sentry.sentry.io to create the new issues (i.e. all derived errors go to the same region), but that might be problematic for performance. We can always heavily sample the number of derived errors we send though.

@mdtr @jjbayer, why do we need data locality?

text/0063-sdk-crash-monitoring.md

jjbayer · 2023-01-19T11:01:21Z

text/0063-sdk-crash-monitoring.md

+
+### Option 2: Detect in SDKs <a name="option-2"></a>
+
+When the SDK sends a crash event to Sentry, it checks the stacktrace and checks if the crash stems from the SDK itself by looking at the top frames of the stacktrace. If it does, the SDK also sends the event to a special-cased sentry org, where each SDK gets its project.


Option 2 seems the most straight forward to implement, but it would require a public DSN for every SDK project, which might get polluted by third-party forks of SDKs.

text/0063-sdk-crash-monitoring.md

jjbayer · 2023-01-19T11:10:37Z

text/0063-sdk-crash-monitoring.md

+
+During event processing, after processing the stacktrace, the server detects if a crash stems from any of our SDKs by looking at the top frames of the stacktrace. If the server detects that it does, it duplicates the event and stores it in a special-cased sentry org where each SDK gets its project.
+
+A good candidate to add this functionality is the `event_manager`. Similarly, where we call [`_detect_performance_problems`](https://github.com/getsentry/sentry/blob/4525f70a1fb521445bbb4c9250b2e15e05b059c3/src/sentry/event_manager.py#L2461), we could add an extra function called `detect_sdk_crashes`.


Option 1 seems to be the only viable option if we want this to evolve into a general by-product for library / SDK maintainers (see "Background"). However, if all we care about in the foreseeable future are sentry SDK crashes, then Option 2 is probably easier to implement.

to evolve into a general by-product for library / SDK maintainers

Yes, one of the main ideas would be that Sentry SDK maintainers get it for free.

HazAT · 2023-01-19T15:59:59Z

My 2 cents are, I vote for option 1.

Both options 2 & 3 produce a lot of busy work, that yet again is very hard to align across all SDKs and even then will not get it right a 100%.

I think one con to add to Option 1 is: no 100% guarantee - but one can argue this is the case for all options.

The way this could work is as simple as having a check:

if sdk == sentry.javascript.browser 
and
frames.contain "@sentry"
return true

mdtro · 2023-01-19T16:09:44Z

It's necessary for us to make sure we have consent from the client to opt-in to sending these reports to us. My first thought was a toggle on the project settings in the Sentry dashboard. Thoughts?

HazAT · 2023-01-19T23:01:36Z

re @mdtro do we really tho?
We are already processing their stack traces, we are just increasing a counter if the top frame is from our library.
I don't see why we need to ask anyone for permission.

or not just a counter - but we only look and care about our code.

text/0063-sdk-crash-monitoring.md

kahest · 2023-01-23T16:21:33Z

Looking at this strictly from the SDK monitoring PoV, a big pro of option 1 IMO is that detection isn't performed on the SDK side. Large part of the rationale for this initiative is to find out if SDKs do something wrong which we are not aware about, and adding code in the SDK to solve this seems a bit counter intuitive and could have a lot of limitations and drawbacks.

marandaneto · 2023-01-23T16:47:21Z

One drawback that is not written yet is that this would only work for errors that the SDK is still able to operate/send the event to Sentry.
When the SDK has a major bug that is in the transport layer or something, the SDK won't be able to even send the event.
Just trying to understand if this is really beneficial and would help a lot or just for minor issues.

Co-authored-by: Karl Heinz Struggl <kahest@users.noreply.github.com>

philipphofmann · 2023-01-25T08:16:16Z

One drawback that is not written yet is that this would only work for errors that the SDK is still able to operate/send the event to Sentry. When the SDK has a major bug that is in the transport layer or something, the SDK won't be able to even send the event. Just trying to understand if this is really beneficial and would help a lot or just for minor issues.

@marandaneto, I think our automated tests must surface such severe bugs. The goal of this proposal is definitely not to help to surface such severe bugs. This solution will be useful for edge case crashes hard to detect during CI. I clarified this with 9eeb549.

kahest

Great outcome, thanks to all participants!

philipphofmann added 2 commits January 18, 2023 11:04

rfc(feature): 0063 SDK Crash Monitoring

2eaec31

Proposal for detecting crashes caused by our SDKs to improve reliability.

cleanup

f2738e1

marandaneto reviewed Jan 18, 2023

View reviewed changes

text/0063-sdk-crash-monitoring.md Show resolved Hide resolved

fix background

59f9ab7

philipphofmann marked this pull request as draft January 18, 2023 10:15

silent1mezzo reviewed Jan 18, 2023

View reviewed changes

text/0063-sdk-crash-monitoring.md Outdated Show resolved Hide resolved

mdtro reviewed Jan 18, 2023

View reviewed changes

update list of options

3f5955a

mdtro reviewed Jan 18, 2023

View reviewed changes

text/0063-sdk-crash-monitoring.md Outdated Show resolved Hide resolved

philipphofmann added 4 commits January 19, 2023 09:22

review feedback

07b159c

code review

1545d84

cleanup

ce536f2

change status to active

09871ac

philipphofmann marked this pull request as ready for review January 19, 2023 08:57

SDK crash health

9594427

antonpirker reviewed Jan 19, 2023

View reviewed changes

text/0063-sdk-crash-monitoring.md Show resolved Hide resolved

text/0063-sdk-crash-monitoring.md Outdated Show resolved Hide resolved

jjbayer reviewed Jan 19, 2023

View reviewed changes

philipphofmann added 3 commits January 20, 2023 10:02

Include feedback

274c5af

Merge branch 'main' into rfc/sdk-crash-monitoring

4ea09b8

one more pro for option 1

99104e2

kahest reviewed Jan 23, 2023

View reviewed changes

text/0063-sdk-crash-monitoring.md Outdated Show resolved Hide resolved

philipphofmann and others added 3 commits January 25, 2023 08:55

Update text/0063-sdk-crash-monitoring.md

46b06fb

Co-authored-by: Karl Heinz Struggl <kahest@users.noreply.github.com>

Merge branch 'main' into rfc/sdk-crash-monitoring

53e1a13

Add legal perspective

d3a9cb3

Add Manoels drawback

213f4be

philipphofmann added 4 commits January 25, 2023 09:21

severe bugs

9eeb549

Clarified that not signed off by legal yet

c5ac335

decision

1ef1a86

Add approver

fcd8603

kahest approved these changes Jan 27, 2023

View reviewed changes

philipphofmann merged commit 6b50854 into main Jan 27, 2023

philipphofmann deleted the rfc/sdk-crash-monitoring branch January 27, 2023 10:57

mitsuhiko restored the rfc/sdk-crash-monitoring branch February 3, 2023 08:32

philipphofmann mentioned this pull request Feb 9, 2023

Sentry SDK Crash Detection for Cocoa getsentry/sentry#44342

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc(feature): 0063 SDK Crash Monitoring #63

rfc(feature): 0063 SDK Crash Monitoring #63

philipphofmann commented Jan 18, 2023 •

edited by mitsuhiko

Loading

marandaneto Jan 18, 2023

philipphofmann Jan 19, 2023

marandaneto Jan 20, 2023

mdtro Jan 18, 2023

philipphofmann Jan 19, 2023

jjbayer Jan 19, 2023

philipphofmann Jan 20, 2023

jjbayer Jan 19, 2023

jjbayer Jan 19, 2023

philipphofmann Jan 20, 2023

HazAT commented Jan 19, 2023 •

edited

Loading

mdtro commented Jan 19, 2023

HazAT commented Jan 19, 2023 •

edited

Loading

kahest commented Jan 23, 2023

marandaneto commented Jan 23, 2023

philipphofmann commented Jan 25, 2023 •

edited

Loading

kahest left a comment


		### Cons <a name="option-1-cons"></a>

		1. Please add your cons.


		### Option 1: Detect during event processing <a name="option-1"></a>

		During event processing, after processing the stacktrace, the server detects if a crash stems from any of our SDKs by looking at the top frames of the stacktrace. If the server detects that it does, it could duplicate the event and store it in a special-cased sentry org where each SDK gets its project.


		### Option 2: Detect in SDKs <a name="option-2"></a>

		When the SDK sends a crash event to Sentry, it checks the stacktrace and checks if the crash stems from the SDK itself by looking at the top frames of the stacktrace. If it does, the SDK also sends the event to a special-cased sentry org, where each SDK gets its project.


		During event processing, after processing the stacktrace, the server detects if a crash stems from any of our SDKs by looking at the top frames of the stacktrace. If the server detects that it does, it duplicates the event and stores it in a special-cased sentry org where each SDK gets its project.

		A good candidate to add this functionality is the `event_manager`. Similarly, where we call [`_detect_performance_problems`](https://github.com/getsentry/sentry/blob/4525f70a1fb521445bbb4c9250b2e15e05b059c3/src/sentry/event_manager.py#L2461), we could add an extra function called `detect_sdk_crashes`.

rfc(feature): 0063 SDK Crash Monitoring #63

rfc(feature): 0063 SDK Crash Monitoring #63

Conversation

philipphofmann commented Jan 18, 2023 • edited by mitsuhiko Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HazAT commented Jan 19, 2023 • edited Loading

mdtro commented Jan 19, 2023

HazAT commented Jan 19, 2023 • edited Loading

kahest commented Jan 23, 2023

marandaneto commented Jan 23, 2023

philipphofmann commented Jan 25, 2023 • edited Loading

kahest left a comment

Choose a reason for hiding this comment

philipphofmann commented Jan 18, 2023 •

edited by mitsuhiko

Loading

HazAT commented Jan 19, 2023 •

edited

Loading

HazAT commented Jan 19, 2023 •

edited

Loading

philipphofmann commented Jan 25, 2023 •

edited

Loading