Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request hedging #52

Open
hannahhoward opened this issue Jan 12, 2024 · 2 comments
Open

Request hedging #52

hannahhoward opened this issue Jan 12, 2024 · 2 comments

Comments

@hannahhoward
Copy link
Contributor

What

Currently our JS client offers two kinds of smart fetches:

  1. We fetch through DNS, and then if the request completely failes (after 5 seconds) we retry as a fallback using a node return from the orchestator.

  2. We immediately race both DNS AND multiple nodes returned from the orchestrator and take the request that returns a first byte (

We proposes a third "request hedging" approach:

  1. Initiate a request with DNS
  2. If a time equaling Saturn's P90 TTFB passes without receiving a first byte, start a second request to an orchestrator node, and take which ever returns a first byte first (cancleling the other)

Why

Our first approach suffers from being a very non-ideal experience -- while the fallback prevents a complete failure, it only falls back after such a long time (i.e. 5 seconds) as to provide a terrible experience for the user.

Our second approach we have found to generate a high amount of duplicate traffic -- we've even overloaded the log ingestor a couple times this way.

This approach essentially aims to improve the fallback experience of our first approach without incurring the problems associated with the second approach.

Cost

To get this done we would need to:

  • implement the third "request hedging" approach in the JS client, available as an option on the request
  • for P90 TTFB for now we might just manually copy whatever our current value is
    • a future improvement might retrieve this value from the orchestrator along with the list of proximate nodes
  • deploy and test as an experiment on the arc network
  • deploy as the primary strategy in ARC and the service worker
@reidlw
Copy link

reidlw commented Jan 16, 2024

I'm going to push that on something like this, we shouldn't start on it until we have defined and implemented the metric(s) which tell us if there is a problem, how bad it is, and whether what we do fixes it.

If there's a reasonable hypothesis that this is, in fact, a problem with the service we should prioritize work in, then maybe we start this work with a task to get the data to build the case.

At the end of the day this feature (likely) improves tail TTFBs. Do we think that's a high priority investment area right now?

@prodalex
Copy link

I understand this was suggested as a counter measure from a post-mortem on node operator error rates.
Can we therefore clarify as to how much this proposed item addresses reliability and production improvement vs tail performance. I see that the following production problem is being addressed: "duplicate traffic -- we've even overloaded the log ingestor a couple times this way."
The best way to evaluate this would be to assume 30 customers using the service worker (for example if we open the portal in the near-term). Will this cause more production problems and overloading the log ingestor even more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants