Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make graph-node more robust in the face of shards being down or restarting #2815

Closed
wants to merge 27 commits into from

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Sep 23, 2021

This PR is a continuation of PR #2727 and rounds out the story of making graph-node handle shards being down during startup and during normal operation more gracefully, including coming back to a sane state when a failed shard comes back up. Even so, it's probably a good idea to restart graph-node after a shard restarted.

The basic ideas behind these changes are:

  • queries should fail fast if they involve a failed shard
  • queries should only fail if they involve a failed shard
  • when the primary fails, queries against other shards should still work (achieved with home-grown table replication)
  • indexing operations in a failed shard retry and block indefinitely
  • indexing operations in other shards are not affected by a shard failure, even when the primary failed
  • indexing will not make progress if the failed shard contains the block store for the subgraph's network, but will resume when the block store becomes available again

Testing these things is unfortunately a very manual process. Testing was done under very little load, and it's possible that this did not reveal issues that only appear under high loads. Here are my notes on what I tested:

Setup

  • three shards: primary, sharda, and shardb
  • block store also in primary
  • one simple subgraph (dice2win) indexing in each shard

Notes

  • indexing service API returns errors when any shard is down if that shard is needed to form response
  • not clear how all this works under high load

Scenarios

sharda is down on startup

  • graph-node starts up
  • queries against primary and shardb work
  • queries against sharda fail fast
  • indexing in primary and shardb continues
  • queries against sharda work after it comes back up
  • indexing in sharda resumes when sharda comes back up

sharda goes down during operation

  • queries against primary and shardb work
  • queries against sharda fail fast
  • indexing in primary and shardb continues
  • queries against sharda work after it comes back up
  • indexing in sharda resumes when sharda comes back up

primary is down on startup

  • graph-node starts up
  • queries against sharda and shardb work
  • queries against primary fail fast
  • queries against primary work after it comes back up
  • indexing resumes when primary comes back up

primary goes down during operation

  • queries against sharda and shardb work
  • queries against primary fail fast
  • queries against primary work after it comes back up
  • indexing resumes when primary comes back up

@lutter lutter force-pushed the lutter/startup-failed-primary branch from 5e4d5f9 to c073e3a Compare September 24, 2021 15:29
Copy link
Contributor

@evaporei evaporei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, great PR 👏 It took me some time to review haha.

Well, I've posted multiple questions and suggestions, some are more style related though.

Some questions I have about the PR description are:

Even so, it's probably a good idea to restart graph-node after a shard restarted.

I'm curious, why is that?

queries should only fail if they involve a failed shard

What part of this scenario is not covered today by #2727? Also, after reading through your code, it seems that the behavior is to try to get a connection from a shard and if that fails, we try on the next one. How would a failed shard get a query invoked? 🤔 (maybe I just misunderstood your statement above)

indexing operations in a failed shard retry and block indefinitely

What happens today on the indexing part? I think I'm more aware on the query side.

Awesome test cases by the way, that must've been a lot of work 😅

store/postgres/src/notification_listener.rs Outdated Show resolved Hide resolved
store/postgres/src/notification_listener.rs Outdated Show resolved Hide resolved
store/postgres/src/notification_listener.rs Show resolved Hide resolved
store/postgres/src/notification_listener.rs Outdated Show resolved Hide resolved
store/postgres/src/notification_listener.rs Show resolved Hide resolved
store/postgres/src/primary.rs Show resolved Hide resolved
store/postgres/src/connection_pool.rs Show resolved Hide resolved
store/postgres/src/connection_pool.rs Show resolved Hide resolved
store/postgres/src/subgraph_store.rs Show resolved Hide resolved
store/postgres/src/subgraph_store.rs Outdated Show resolved Hide resolved
@lutter
Copy link
Collaborator Author

lutter commented Sep 30, 2021

First of all, great PR clap It took me some time to review haha.

Thanks!

Some questions I have about the PR description are:

Even so, it's probably a good idea to restart graph-node after a shard restarted.

I'm curious, why is that?

There's at least two things that I am worried about: (1) I am not 100% confident that this catches all ways in which indexing could fail when a shard goes down (especially when the shard with the block store goes down at the wrong moment) and (2) because listeners reconnect after some delay, it's possible that they miss assignment events, and, e.g., a subgraph deployed during that time doesn't start syncing.

queries should only fail if they involve a failed shard

What part of this scenario is not covered today by #2727? Also, after reading through your code, it seems that the behavior is to try to get a connection from a shard and if that fails, we try on the next one. How would a failed shard get a query invoked? thinking (maybe I just misunderstood your statement above)

#2727 does not cover the scenario where the primary fails - without this PR, when the primary fails, most queries will fail, since the primary is needed to determine where a subgraph is stored. There's a bit of caching in memory of this data, but it's with a 120s TTL and doesn't cover the resolution of subgraph names to hashes.

The only data that we try getting from different shards is whatever is mentioned in primary::Mirror, but that's the crucial stuff that helps in resolving subgraph names to deployments, and finding which shard has the data for a deployment.

indexing operations in a failed shard retry and block indefinitely

What happens today on the indexing part? I think I'm more aware on the query side.

The subgraph will just fail and needs to be restarted.

@lutter
Copy link
Collaborator Author

lutter commented Sep 30, 2021

Addressed all review comments (I think)

Copy link
Contributor

@evaporei evaporei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving because it seems correct and I think I understood what most of your code does. However I'm not sure if you want anyone else to take a look just to to be more sure that none of the new code can have unexpected behavior on the database connection/query handling/execution.

@lutter lutter force-pushed the lutter/startup-failed-primary branch from 08e5bf8 to 78cb1a8 Compare October 2, 2021 22:22
@lutter
Copy link
Collaborator Author

lutter commented Oct 2, 2021

Rebased to latest master

@@ -81,6 +82,11 @@ impl NotificationListener {
/// channel.
///
/// Must call `.start()` to begin receiving notifications on the returned receiver.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but I think there should be another / here.

@evaporei
Copy link
Contributor

evaporei commented Oct 3, 2021

Also, one thing I forgot to ask on the first review is about the Cows.
I read your commit description on it and it makes sense, but I didn't actually found where this is used on the retry.
I'm probably looking at the wrong place. Can you give me a hint on that please?

Instead of crashing the process, try to reconnect indefinitely if we lose
the connection to the database.
Manually insert/delete rows from the primary's deployment_schemas and
chains table. We can't use logical replication because we need to support
Postgres 9.6.

Add a job that refreshes the mirror every 5 minutes
Use that facility for the places where the BlockStore reads from the primary
We read the assignments for a node early during startup; with this change,
a node will start up even when the primary is down
This is in preparation of retrying operations: if we pass owned data, we
need to clone before every attempt. Instead of forcing a clone in the
common case of the first try succeeding, work with references.

This also requires that we use Cow<Entity> because in rare cases we have to
modify the entity to add fulltext search fields; in the common case where
there is no fulltext search, we do not want to clone.
@lutter lutter force-pushed the lutter/startup-failed-primary branch from 78cb1a8 to 18cb404 Compare October 4, 2021 17:04
When we transact block operations, or revert a block, ignore errors when
sending store events. Since store events sent for these operations are only
used to update subscriptions, it is better to carry on than to fail the
operation and with that the subgraph.
This can block indefinitely and therefore lead to a lot of work on query
nodes queueing up; rather than that, operations should fail when we can not
get a connection.
@lutter lutter force-pushed the lutter/startup-failed-primary branch from 18cb404 to 393ce06 Compare October 4, 2021 17:08
@lutter
Copy link
Collaborator Author

lutter commented Oct 4, 2021

Also, one thing I forgot to ask on the first review is about the Cows. I read your commit description on it and it makes sense, but I didn't actually found where this is used on the retry. I'm probably looking at the wrong place. Can you give me a hint on that please?

What changed with retrying is that we need two ways to access the entities we are about to write to the store: one for forming the actual relational_queries::InsertQuery and one if the insert fails and we need to retry. Without retrying, we just passed ownership of the entities into the InsertQuery, where the entities can get mutated if there are fulltext search fields. Because we still need access to the entities for a possible retry, we can't pass ownership to InsertQuery any more. At the same time, mutating the entities in InsertQuery is pretty rare, so we don't want to just clone entities on every insert. That's where the CoW comes into play since it allows us to pass references into the InsertQuery: in the common case, the insert will just use references, and only when there are fulltext search fields will a clone happen.

@lutter lutter closed this Oct 4, 2021
@lutter lutter deleted the lutter/startup-failed-primary branch October 4, 2021 17:59
@lutter lutter restored the lutter/startup-failed-primary branch October 4, 2021 18:19
@lutter
Copy link
Collaborator Author

lutter commented Oct 4, 2021

I made a mistake when merging this (not pushing the updated master) This is fixed now, and the PR has been merged at commit 042cf0a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants