Tolerate a non-primary shard being down during startup #2727

lutter · 2021-08-20T22:40:59Z

These changes make it so that if a shard that is not the primary is not available (down) when graph-node starts up, graph-node will still start and respond to queries against subgraphs in the other shards. Furthermore, while the shard is down, graph-node now responds to queries against the failed shard with an error much quicker (roughly within GRAPH_STORE_CONNECTION_TIMEOUT) where before it held on to the query indefinitely.

Scenarios I tested when a non-primary shard is down:

graph-node starts up successfully
queries against other shards work
queries against the failed shard return an error (there's a bit too much internal info in the error message though)
queries against the failed shard work properly once that shard becomes available

Unfortunately, we can't just slap a "set port 'NNNN'" into the 'alter server' statement since we have installations that have a server without a port, and for those we first need to "add port 'NNNN'".

When we initialize a connection pool, we should only assume that the database for that pool is up and running. Previously, we used `import foreign schema` to create fdw tables locally, but that requires that Postgres access the remote server, and will therefore not work when that server is not up. We now create these tables by assuming the corresponding table on the remote server has the same schema as our local table, which avoids accessing the remote server. It also removes a subtle race where we could have pulled in the schema from a remote server before that ran its latest migrations.

This helps make it clearer how the underlying Diesel pool is accessed

When one of the configured shards is not available, defer setting it up (e.g., running migrations) until it becomes available.

lutter force-pushed the lutter/startup-failed-shard branch from 7708109 to 2c9f35e Compare August 24, 2021 00:07

tilacog self-requested a review August 27, 2021 14:19

tilacog approved these changes Aug 27, 2021

View reviewed changes

lutter added 8 commits August 27, 2021 14:13

store: Include the port when setting up fdw connections

e0b65f8

Unfortunately, we can't just slap a "set port 'NNNN'" into the 'alter server' statement since we have installations that have a server without a port, and for those we first need to "add port 'NNNN'".

store: Do not impl Deref for ConnectionPool

b59f512

This helps make it clearer how the underlying Diesel pool is accessed

store: Do not set connection timeout for ConnectioPool to 6hrs

3cd0686

node, store: Make ConnectionPool::setup not async

79c18bd

store: Return only StoreError from ConnectionPool methods

3265db9

graph, node, store: Tolerate shards not being available during startup

e19e5e2

When one of the configured shards is not available, defer setting it up (e.g., running migrations) until it becomes available.

store: Make the connection timeout configurable and default to 5s

6775ad9

lutter force-pushed the lutter/startup-failed-shard branch from 2c9f35e to 6775ad9 Compare August 27, 2021 21:14

lutter merged commit 6775ad9 into master Aug 27, 2021

lutter deleted the lutter/startup-failed-shard branch August 27, 2021 21:14

lutter mentioned this pull request Sep 23, 2021

Make graph-node more robust in the face of shards being down or restarting #2815

Closed

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerate a non-primary shard being down during startup #2727

Tolerate a non-primary shard being down during startup #2727

lutter commented Aug 20, 2021

Tolerate a non-primary shard being down during startup #2727

Tolerate a non-primary shard being down during startup #2727

Conversation

lutter commented Aug 20, 2021