Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report retrieval network errors #43

Closed
Tracked by #22 ...
bajtos opened this issue Nov 29, 2023 · 7 comments
Closed
Tracked by #22 ...

Report retrieval network errors #43

bajtos opened this issue Nov 29, 2023 · 7 comments

Comments

@bajtos
Copy link
Member

bajtos commented Nov 29, 2023

ETA: 2024-07-16

We are seeing measurements where start_at and finished_at are filled correctly, but:

  • first_byte_at and end_at are set to new Date(0)
  • status_code is null
  • timeout is false

We think these measurements are created when the request fetch('ipfs://...') fails before receiving response headers.

It would be great to get more insights into this case.

@bajtos
Copy link
Member Author

bajtos commented Nov 29, 2023

Currently, further error details will be printed to the operator's console, in

console.error(err)

Looking at https://developer.mozilla.org/en-US/docs/Web/API/fetch, only the error condition "If fetch throws a network error." makes sense here (unless we for some reason provided invalid arguments to fetch).

It's not clear whether the network error can be caused by the client, the server, or both. MDN says this, but it could still be both:

A fetch() promise only rejects when a network error is encountered (which is usually when there's a permissions issue or similar)

Here are a few examples from Node.js:

> await fetch('http://127.0.0.1:123/').catch(err => console.log(err.cause?.message ?? err.message))
bad port

> await fetch('http://invalid.example.com/').catch(err => console.log(err.cause?.message ?? err.message))
getaddrinfo ENOTFOUND invalid.example.com

> await fetch('http://127.0.0.1:9999/').catch(err => console.log(err.cause?.message ?? err.message))
connect ECONNREFUSED 127.0.0.1:9999

Let's detect these well-known causes and report them in a machine-readable form in the measurements. For example, we can invent new HTTP status codes for this:

  • 900 - cannot resolve hostname (getaddrinfo ENOTFOUND)
  • 901 - connection refused (connect ECONNREFUSED)
    Everything else will be reported via networkError to allow us to discover more error cases later.

However: Spark makes fetch requests with ipfs:// URLs; these requests are routed to the Lassie HTTP daemon running inside Zinnia. It's unclear under which conditions would such a request fail with a network error?!

Slack discussion: https://filecoinproject.slack.com/archives/C03PVLW721L/p1701174267028939

@bajtos
Copy link
Member Author

bajtos commented Nov 29, 2023

Hmm, this is more complex than we think. AFAICT, spark is always reporting end_at, see here:

stats.endAt = new Date()

And yet we have measurements where it’s set to Date(0) (see the screenshot):
Screenshot 2023-11-29 at 16 32 50

@bajtos bajtos changed the title Report network errors Report retrieval network errors Nov 29, 2023
@bajtos
Copy link
Member Author

bajtos commented Nov 29, 2023

In my Station module logs, I see that my SPARK checker is reporting measurements with end_at set to null.

[2023-11-27T17:24:47Z INFO  module:spark/main] Fetching ipfs://bafybeigkolhanv5xbu3pwmpfuzvu6pia4qtmtesahti32uzecg2oemqu3a?protocols=graphsync&providers=%2Fip4%2F138.113.9.192%2Ftcp%2F34561%2Fp2p%2F12D3KooW
Biq7qeeN6D7D8iu8vomPjZEPmUD9deKn8UkHb8Kedcv8...

[2023-11-27T17:25:47Z ERROR module:spark/main] Failed to fetch ipfs://bafybeigkolhanv5xbu3pwmpfuzvu6pia4qtmtesahti32uzecg2oemqu3a?protocols=graphsync&providers=%2Fip4%2F138.113.9.192%2Ftcp%2F34561%2Fp2p%2F1
2D3KooWBiq7qeeN6D7D8iu8vomPjZEPmUD9deKn8UkHb8Kedcv8
[2023-11-27T17:25:47Z ERROR module:spark/main] DOMException: The signal has been aborted
[2023-11-27T17:25:47Z INFO  module:spark/main] Submitting measurement...
[2023-11-27T17:25:47Z INFO  module:spark/main] {
      sparkVersion: "1.6.0",
      zinniaVersion: "0.15.0",
      cid: "bafybeigkolhanv5xbu3pwmpfuzvu6pia4qtmtesahti32uzecg2oemqu3a",
      providerAddress: "/ip4/138.113.9.192/tcp/34561/p2p/12D3KooWBiq7qeeN6D7D8iu8vomPjZEPmUD9deKn8UkHb8Kedcv8",
      protocol: "graphsync",
      timeout: true,
      startAt: 2023-11-27T17:24:47.388Z,
      firstByteAt: null,
      endAt: null,
      carTooLarge: false,
      byteLength: 0,
      carChecksum: null,
      statusCode: null,
      participantAddress: "f410..."
    }

[2023-11-27T17:25:48Z INFO  module:spark/main] Measurement submitted (id: 85714174)

@bajtos
Copy link
Member Author

bajtos commented Nov 30, 2023

Over 40% retrievals fail with unknown error. Possibly because fetch('ipfs://…') threw an error, which means there was a network error between the Zinnia module and the built-in Lassie HTTP server. Maybe rusty-lassie crashes on some systems and zinniad does not handle this properly?

@bajtos
Copy link
Member Author

bajtos commented Jul 10, 2024

@juliangruber
Copy link
Member

Thank you for the report here, let's see how things change with the new Spark version?

And indeed it's a bit tricky to infer from the data which error happened exactly

@bajtos
Copy link
Member Author

bajtos commented Jul 10, 2024

All changes are live, and the Retrieval Status chart shows the new descriptive result codes now. I am closing this issue as done.

Screenshot 2024-07-10 at 12 10 20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ done
Development

No branches or pull requests

2 participants