Make graph-node more robust in the face of shards being down or restarting #2815

lutter · 2021-09-23T20:29:25Z

This PR is a continuation of PR #2727 and rounds out the story of making graph-node handle shards being down during startup and during normal operation more gracefully, including coming back to a sane state when a failed shard comes back up. Even so, it's probably a good idea to restart graph-node after a shard restarted.

The basic ideas behind these changes are:

queries should fail fast if they involve a failed shard
queries should only fail if they involve a failed shard
when the primary fails, queries against other shards should still work (achieved with home-grown table replication)
indexing operations in a failed shard retry and block indefinitely
indexing operations in other shards are not affected by a shard failure, even when the primary failed
indexing will not make progress if the failed shard contains the block store for the subgraph's network, but will resume when the block store becomes available again

Testing these things is unfortunately a very manual process. Testing was done under very little load, and it's possible that this did not reveal issues that only appear under high loads. Here are my notes on what I tested:

Setup

three shards: primary, sharda, and shardb
block store also in primary
one simple subgraph (dice2win) indexing in each shard

Notes

indexing service API returns errors when any shard is down if that shard is needed to form response
not clear how all this works under high load

Scenarios

`sharda` is down on startup

graph-node starts up
queries against primary and shardb work
queries against sharda fail fast
indexing in primary and shardb continues
queries against sharda work after it comes back up
indexing in sharda resumes when sharda comes back up

`sharda` goes down during operation

queries against primary and shardb work
queries against sharda fail fast
indexing in primary and shardb continues
queries against sharda work after it comes back up
indexing in sharda resumes when sharda comes back up

`primary` is down on startup

graph-node starts up
queries against sharda and shardb work
queries against primary fail fast
queries against primary work after it comes back up
indexing resumes when primary comes back up

`primary` goes down during operation

queries against sharda and shardb work
queries against primary fail fast
queries against primary work after it comes back up
indexing resumes when primary comes back up

evaporei

First of all, great PR 👏 It took me some time to review haha.

Well, I've posted multiple questions and suggestions, some are more style related though.

Some questions I have about the PR description are:

Even so, it's probably a good idea to restart graph-node after a shard restarted.

I'm curious, why is that?

queries should only fail if they involve a failed shard

What part of this scenario is not covered today by #2727? Also, after reading through your code, it seems that the behavior is to try to get a connection from a shard and if that fails, we try on the next one. How would a failed shard get a query invoked? 🤔 (maybe I just misunderstood your statement above)

indexing operations in a failed shard retry and block indefinitely

What happens today on the indexing part? I think I'm more aware on the query side.

Awesome test cases by the way, that must've been a lot of work 😅

store/postgres/src/notification_listener.rs

store/postgres/src/primary.rs

store/postgres/src/connection_pool.rs

store/postgres/src/subgraph_store.rs

lutter · 2021-09-30T21:25:03Z

First of all, great PR clap It took me some time to review haha.

Thanks!

Some questions I have about the PR description are:

Even so, it's probably a good idea to restart graph-node after a shard restarted.

I'm curious, why is that?

There's at least two things that I am worried about: (1) I am not 100% confident that this catches all ways in which indexing could fail when a shard goes down (especially when the shard with the block store goes down at the wrong moment) and (2) because listeners reconnect after some delay, it's possible that they miss assignment events, and, e.g., a subgraph deployed during that time doesn't start syncing.

queries should only fail if they involve a failed shard

What part of this scenario is not covered today by #2727? Also, after reading through your code, it seems that the behavior is to try to get a connection from a shard and if that fails, we try on the next one. How would a failed shard get a query invoked? thinking (maybe I just misunderstood your statement above)

#2727 does not cover the scenario where the primary fails - without this PR, when the primary fails, most queries will fail, since the primary is needed to determine where a subgraph is stored. There's a bit of caching in memory of this data, but it's with a 120s TTL and doesn't cover the resolution of subgraph names to hashes.

The only data that we try getting from different shards is whatever is mentioned in primary::Mirror, but that's the crucial stuff that helps in resolving subgraph names to deployments, and finding which shard has the data for a deployment.

indexing operations in a failed shard retry and block indefinitely

What happens today on the indexing part? I think I'm more aware on the query side.

The subgraph will just fail and needs to be restarted.

lutter · 2021-09-30T21:48:15Z

Addressed all review comments (I think)

evaporei

I'm approving because it seems correct and I think I understood what most of your code does. However I'm not sure if you want anyone else to take a look just to to be more sure that none of the new code can have unexpected behavior on the database connection/query handling/execution.

lutter · 2021-10-02T22:22:21Z

Rebased to latest master

evaporei · 2021-10-03T03:04:47Z

store/postgres/src/notification_listener.rs

@@ -81,6 +82,11 @@ impl NotificationListener {
    /// channel.
    ///
    /// Must call `.start()` to begin receiving notifications on the returned receiver.
+    //


Small thing, but I think there should be another / here.

evaporei · 2021-10-03T03:32:37Z

Also, one thing I forgot to ask on the first review is about the Cows.
I read your commit description on it and it makes sense, but I didn't actually found where this is used on the retry.
I'm probably looking at the wrong place. Can you give me a hint on that please?

Instead of crashing the process, try to reconnect indefinitely if we lose the connection to the database.

Manually insert/delete rows from the primary's deployment_schemas and chains table. We can't use logical replication because we need to support Postgres 9.6. Add a job that refreshes the mirror every 5 minutes

Use that facility for the places where the BlockStore reads from the primary

We read the assignments for a node early during startup; with this change, a node will start up even when the primary is down

This is in preparation of retrying operations: if we pass owned data, we need to clone before every attempt. Instead of forcing a clone in the common case of the first try succeeding, work with references. This also requires that we use Cow<Entity> because in rare cases we have to modify the entity to add fulltext search fields; in the common case where there is no fulltext search, we do not want to clone.

…here

When we transact block operations, or revert a block, ignore errors when sending store events. Since store events sent for these operations are only used to update subscriptions, it is better to carry on than to fail the operation and with that the subgraph.

This can block indefinitely and therefore lead to a lot of work on query nodes queueing up; rather than that, operations should fail when we can not get a connection.

lutter · 2021-10-04T17:21:01Z

Also, one thing I forgot to ask on the first review is about the Cows. I read your commit description on it and it makes sense, but I didn't actually found where this is used on the retry. I'm probably looking at the wrong place. Can you give me a hint on that please?

What changed with retrying is that we need two ways to access the entities we are about to write to the store: one for forming the actual relational_queries::InsertQuery and one if the insert fails and we need to retry. Without retrying, we just passed ownership of the entities into the InsertQuery, where the entities can get mutated if there are fulltext search fields. Because we still need access to the entities for a possible retry, we can't pass ownership to InsertQuery any more. At the same time, mutating the entities in InsertQuery is pretty rare, so we don't want to just clone entities on every insert. That's where the CoW comes into play since it allows us to pass references into the InsertQuery: in the common case, the insert will just use references, and only when there are fulltext search fields will a clone happen.

lutter · 2021-10-04T18:22:22Z

I made a mistake when merging this (not pushing the updated master) This is fixed now, and the PR has been merged at commit 042cf0a

lutter force-pushed the lutter/startup-failed-primary branch from 5e4d5f9 to c073e3a Compare September 24, 2021 15:29

evaporei reviewed Sep 30, 2021

View reviewed changes

evaporei approved these changes Sep 30, 2021

View reviewed changes

lutter force-pushed the lutter/startup-failed-primary branch from 08e5bf8 to 78cb1a8 Compare October 2, 2021 22:22

evaporei reviewed Oct 3, 2021

View reviewed changes

lutter added 17 commits October 4, 2021 10:03

store: Reconnect notification listener when database drops

4008ef6

Instead of crashing the process, try to reconnect indefinitely if we lose the connection to the database.

store: Don't block startup because notification listener can't connect

d874f7a

store: Don't log connection details for 'Connection refused' errors

66fe268

store: Mirror chains and deployment_schemas in all shards

39de144

Manually insert/delete rows from the primary's deployment_schemas and chains table. We can't use logical replication because we need to support Postgres 9.6. Add a job that refreshes the mirror every 5 minutes

store: Allow reading from the primary and mirrored tables

6354d69

Use that facility for the places where the BlockStore reads from the primary

store: Map and mirror subgraph_deployment_assignment to read assignments

8b37a41

We read the assignments for a node early during startup; with this change, a node will start up even when the primary is down

store: Mirror the 'subgraph' and 'subgraph_version' table

eb1814d

store: Use mirrored tables in a lot of readonly SubgraphStore methods

40decfa

store: Mark any failure in getting a connection as 'db unavailable'

ba6d765

store: Track db availability to fail fast when db is unavailable

0c87b8d

store: Add a gauge for the size of the connection pool

c12ee0f

store: Allow bypassing availability logic in the conn pool

7dc3cd2

core, graph, store: Asyncify store methods that returned DynTryFuture

bbec57e

all: Return StoreError from store methods more consistently

633d047

graph: Simple helper to sleep with exponential backoff

261f426

graph, store: Retry all WritableStore ops when database is not available

aba537d

lutter force-pushed the lutter/startup-failed-primary branch from 78cb1a8 to 18cb404 Compare October 4, 2021 17:04

lutter added 3 commits October 4, 2021 10:08

all: Pass a subgraph logger to WritableStore und log retry messages t…

2aa7381

…here

store: Fail fast for an unavailable conn pool that's not ready yet, too

703973e

all: Pass hashes to ChainStore.blocks by reference

ef60ab9

lutter added 7 commits October 4, 2021 10:08

runtime: Only build test::common in test configs

c8a5510

core: Log more detail after starting all subgraphs

d86aa39

store: Remove ConnectionPool.get_with_timeout_warning

6b2cde8

This can block indefinitely and therefore lead to a lot of work on query nodes queueing up; rather than that, operations should fail when we can not get a connection.

store: Rename DeploymentStore.conn to pool

65ec9ea

store: Remove import only needed for tests and not for release

6c0fe5d

store: Small code/comment improvements

393ce06

lutter force-pushed the lutter/startup-failed-primary branch from 18cb404 to 393ce06 Compare October 4, 2021 17:08

lutter closed this Oct 4, 2021

lutter deleted the lutter/startup-failed-primary branch October 4, 2021 17:59

lutter restored the lutter/startup-failed-primary branch October 4, 2021 18:19

azf20 mentioned this pull request Oct 7, 2021

Make graph-node resilient to a shard failure within the database #2713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make graph-node more robust in the face of shards being down or restarting #2815

Make graph-node more robust in the face of shards being down or restarting #2815

lutter commented Sep 23, 2021 •

edited

Loading

evaporei left a comment

lutter commented Sep 30, 2021

lutter commented Sep 30, 2021

evaporei left a comment

lutter commented Oct 2, 2021

evaporei Oct 3, 2021

evaporei commented Oct 3, 2021

lutter commented Oct 4, 2021

lutter commented Oct 4, 2021

Make graph-node more robust in the face of shards being down or restarting #2815

Make graph-node more robust in the face of shards being down or restarting #2815

Conversation

lutter commented Sep 23, 2021 • edited Loading

Setup

Notes

Scenarios

sharda is down on startup

sharda goes down during operation

primary is down on startup

primary goes down during operation

evaporei left a comment

Choose a reason for hiding this comment

lutter commented Sep 30, 2021

lutter commented Sep 30, 2021

evaporei left a comment

Choose a reason for hiding this comment

lutter commented Oct 2, 2021

evaporei Oct 3, 2021

Choose a reason for hiding this comment

evaporei commented Oct 3, 2021

lutter commented Oct 4, 2021

lutter commented Oct 4, 2021

lutter commented Sep 23, 2021 •

edited

Loading

`sharda` is down on startup

`sharda` goes down during operation

`primary` is down on startup

`primary` goes down during operation