Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerate a non-primary shard being down during startup #2727

Merged
merged 8 commits into from
Aug 27, 2021

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Aug 20, 2021

These changes make it so that if a shard that is not the primary is not available (down) when graph-node starts up, graph-node will still start and respond to queries against subgraphs in the other shards. Furthermore, while the shard is down, graph-node now responds to queries against the failed shard with an error much quicker (roughly within GRAPH_STORE_CONNECTION_TIMEOUT) where before it held on to the query indefinitely.

Scenarios I tested when a non-primary shard is down:

  • graph-node starts up successfully
  • queries against other shards work
  • queries against the failed shard return an error (there's a bit too much internal info in the error message though)
  • queries against the failed shard work properly once that shard becomes available

@lutter lutter force-pushed the lutter/startup-failed-shard branch from 7708109 to 2c9f35e Compare August 24, 2021 00:07
@tilacog tilacog self-requested a review August 27, 2021 14:19
Unfortunately, we can't just slap a "set port 'NNNN'" into the 'alter
server' statement since we have installations that have a server without a
port, and for those we first need to "add port 'NNNN'".
When we initialize a connection pool, we should only assume that the
database for that pool is up and running. Previously, we used `import
foreign schema` to create fdw tables locally, but that requires that
Postgres access the remote server, and will therefore not work when that
server is not up.

We now create these tables by assuming the corresponding table on the
remote server has the same schema as our local table, which avoids
accessing the remote server. It also removes a subtle race where we could
have pulled in the schema from a remote server before that ran its latest
migrations.
This helps make it clearer how the underlying Diesel pool is accessed
When one of the configured shards is not available, defer setting it
up (e.g., running migrations) until it becomes available.
@lutter lutter force-pushed the lutter/startup-failed-shard branch from 2c9f35e to 6775ad9 Compare August 27, 2021 21:14
@lutter lutter merged commit 6775ad9 into master Aug 27, 2021
@lutter lutter deleted the lutter/startup-failed-shard branch August 27, 2021 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants