fix: catch uncaught exceptions & gc handles request aborts #102

SgtPooki · 2024-03-29T22:04:13Z

Title

fix: catch uncaught exceptions & gc handles request aborts

Description

A few changes here from #18 (comment)

catch uncaughtExceptions and handle if allowlisted
catch unhandledRejections and handle if allowlisted
handle empty strings in FILE_DATASTORE_PATH & FILE_BLOCKSTORE_PATH
pull out logic for creating a request aware signal
helia.gc is given a request aware signal

Notes & open questions

a lot of fixes here from discoveries made investigating #18

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation if necessary (this includes comments as well)
I have added tests that prove my fix is effective or that my feature works

src/helia-server.ts

achingbrain · 2024-04-04T15:36:38Z

src/index.ts

+const uncaughtHandler = (error: any): void => {
+  log.error('Uncaught Exception:', error)
+  if (ALLOW_UNHANDLED_ERROR_RECOVERY && (RECOVERABLE_ERRORS === 'all' || RECOVERABLE_ERRORS.includes(error?.code) || RECOVERABLE_ERRORS.includes(error?.name))) {
+    log.trace('Ignoring error')


This isn't considered best practice - https://nodejs.org/api/process.html#warning-using-uncaughtexception-correctly

I understand, but we need some kind of error recovery instead of allowing things to just die like they were in Tiros. We could probably default this to FALSE and add a warning in the readme. Tiros needs this and probably some restarting of the server (to follow best practices).

Unhandled exceptions inherently mean that an application is in an undefined state.

I started to just listen for ERR_STREAM_PREMATURE_CLOSE which we know is a recoverable state. "in an undefined state" is not true in this instance.

this change allows us to recover from anything which could cause problems in the future:

There is a case where some error in libp2p/helia/fastify/helia-server.ts could happen that is unrecoverable, and this does an infinite loop of "On error resume next" and we don't find out until money has been eaten up running a dead service.

However, we still need to block this server from dying in instances where we know we can safely recover, and unblocking Tiros was foremost on my mind.

The point of the linked warning is that you can't tell if you can safely recover so the only safe thing you can do is exit the process and restart.

If we have unhandled exceptions being thrown, these are bugs that should be fixed.

I agree they're bugs that should be fixed, but should we allow a server to die given underlying libraries bugs if they're recoverable? Given the expectation of helia-http-server, I don't think so. It's supposed to fetch content using helia/libp2p, and return a result. If they die when fetching, we should certainly return a 500 error instead, and recover, right?

BTW, The only other place I saw ERR_STREAM_PREMATURE_CLOSE was in https://github.com/ChainSafe/js-libp2p-yamux, which is listening for those errors, so i don't think the uncaught exception is coming from there. it's likely coming from somewhere else in the libp2p stack.

edit: or stream_premature_close error is coming from fastify req/resp

I just realized we should add a listener on the req and resp stream for ERR_STREAM_PREMATURE_CLOSE...

updated #112

should we allow a server to die given underlying libraries bugs if they're recoverable?

Again, the point of the linked warning is that you can't tell if you can safely recover.

Consider something like this:

process.on('uncaughtException', (err) => { if (err.message === 'recoverable') { // it's ok, it's recoverable return } console.error(err) process.exit(1) }) const fd = await fs.open('/path/to/file.txt') // something that causes an error throw new Error('recoverable') fd.close()

It looks recoverable, but the file descriptor is never closed so it leaks memory.

Basically if you're in an uncaught exception handler all bets are off.

achingbrain · 2024-04-04T15:37:48Z

src/helia-server.ts

@@ -272,8 +266,26 @@ export class HeliaServer {
     */
    request.raw.on('close', cleanupFn)

+    if (timeout != null) {
+      setTimeout(() => {


Timeouts like this use resources and can keep the process running. Better to use AbortSignal.timeout(ms) and combine signals with any-signal.

I should have cleared the timeout here, good catch, thanks. any-signal does not prevent duplicate handlers from being added to the same signal, so I prefer not to use it.

I will open an issue to update this to abortSignal.timeout(50) and addEventListener.

any-signal does not prevent duplicate handlers from being added to the same signal, so I prefer not to use it.

Are you passing the same signal to anySignal multiple times? Sounds like a bug if so.

I'm not, but I know libraries that do :)

If you are not, then it should be okay to use it.

If you know of libraries that do, can you please open issues or better yet, PRs?

SgtPooki added 2 commits March 29, 2024 15:01

fix: catch uncaught exceptions & gc handles request aborts

ede443f

feat: allow configurable recovery errors

75d0d01

SgtPooki commented Mar 29, 2024

View reviewed changes

src/helia-server.ts Outdated Show resolved Hide resolved

chore: remove duplicate gc call

49dcd6e

SgtPooki merged commit e70742f into main Mar 29, 2024
4 checks passed

SgtPooki deleted the feat/signal-handling branch March 29, 2024 22:45

achingbrain reviewed Apr 4, 2024

View reviewed changes

This was referenced Apr 4, 2024

bug: replace setTimeout with AbortSignal.timeout #111

Open

fix: uncaughtException needs more best practices #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: catch uncaught exceptions & gc handles request aborts #102

fix: catch uncaught exceptions & gc handles request aborts #102

SgtPooki commented Mar 29, 2024

achingbrain Apr 4, 2024

SgtPooki Apr 4, 2024

achingbrain Apr 4, 2024

SgtPooki Apr 4, 2024 •

edited

Loading

SgtPooki Apr 4, 2024

SgtPooki Apr 4, 2024

achingbrain Apr 4, 2024

achingbrain Apr 4, 2024 •

edited

Loading

SgtPooki Apr 4, 2024

achingbrain Apr 4, 2024

SgtPooki Apr 4, 2024

achingbrain Apr 4, 2024

fix: catch uncaught exceptions & gc handles request aborts #102

fix: catch uncaught exceptions & gc handles request aborts #102

Conversation

SgtPooki commented Mar 29, 2024

Title

Description

Notes & open questions

Change checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SgtPooki Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achingbrain Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SgtPooki Apr 4, 2024 •

edited

Loading

achingbrain Apr 4, 2024 •

edited

Loading