Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tenant config & tenant relocation integration #1555

Closed
lubennikovaav opened this issue Apr 21, 2022 · 0 comments · Fixed by #4255
Closed

Fix tenant config & tenant relocation integration #1555

lubennikovaav opened this issue Apr 21, 2022 · 0 comments · Fixed by #4255
Assignees
Labels
c/storage/pageserver Component: storage: pageserver
Milestone

Comments

@lubennikovaav
Copy link
Contributor

Now tenant-specific config doesn't survive the relocation.
We need to persist the config (in console database?) and pass it to tenant attach command.
Alternatively, we may save-restore config file via S3.

@LizardWizzard LizardWizzard self-assigned this Apr 21, 2022
@neondatabase-bot neondatabase-bot bot added this to the 0.7 Towards Tech Prev milestone Apr 21, 2022
@stepashka stepashka added the c/storage/pageserver Component: storage: pageserver label Apr 21, 2022
@neondatabase-bot neondatabase-bot bot modified the milestones: 2022/06, 2022/08 Jul 25, 2022
@neondatabase-bot neondatabase-bot bot modified the milestones: 2022/08, 2022/09 Sep 5, 2022
@neondatabase-bot neondatabase-bot bot modified the milestones: 2022/09, 2022/10 Oct 17, 2022
@neondatabase-bot neondatabase-bot bot removed this from the 2022/10 milestone Nov 9, 2022
@neondatabase-bot neondatabase-bot bot added this to the 2023/01 milestone Dec 20, 2022
@neondatabase-bot neondatabase-bot bot modified the milestones: 2023/01, 2023/02 Jan 9, 2023
@neondatabase-bot neondatabase-bot bot modified the milestones: 2023/02, 2023/03 Feb 19, 2023
@problame problame assigned problame and unassigned LizardWizzard Apr 24, 2023
problame added a commit that referenced this issue May 4, 2023
We currently return 202 as soon as the tenant is allocated in memory
before we've written out the marker file. So, the /attach API currently
does not have a transactional character. For example, it can happen that
we respond with a 202 and then crash before writing out the marker
file. In such a case, it is important that the client

1. observes the lost attach (by polling tenant status and observing 404)
2. and consequently retries the attach.

It has to do it in this loop until it observes the tenant as "Active"
in the tenant status.  If the client doesn't follow this protocol and instead
goes to another pageserver to attach the tenant, we risk a split-brain situation
where both the first and second pageserver write to the tenant's S3 state.

The improved description highlights the consequences of this
behavior for clients that use the /attach endpoint.

My understanding is that the tenant relocation logic being worked on in
cloud#4740 implements retries, but only until it gets a 202. It will need to
implement the polling after the 202 to be correct.

The motivation for this write-up is that, in a future PR, we'll extend
the /attach endpoint with an option to provide the tenant config.
If we decide to leave the non-transactional behavior of /attach
unmodified, we will be able to avoid persisting the tenant config.
Conversely, if we decide that the /attach API should become
transactional, we'll need to persist the tenant config in the
attach-marker-file before acknowledging receipt of the /attach operation.

refs neondatabase/cloud#4740
refs #2238
refs #1555
problame added a commit that referenced this issue May 5, 2023
We currently return 202 as soon as the tenant is allocated in memory
before we've written out the marker file. So, the /attach API currently
does not have a transactional character. For example, it can happen that
we respond with a 202 and then crash before writing out the marker file.
In such a case, it is important that the client

1. observes the lost attach (by polling tenant status and observing 404)
2. and consequently retries the attach.

It has to do it in this loop until it observes the tenant as "Active" in
the tenant status. If the client doesn't follow this protocol and
instead goes to another pageserver to attach the tenant, we risk a
split-brain situation where both the first and second pageserver write
to the tenant's S3 state.

The improved description highlights the consequences of this behavior
for clients that use the /attach endpoint.

The tenant relocation that is currently being implemented in cloud#4740
implements retries of Attach and it does poll afterwards, but, it polls
`has_in_progress_downloads`.
That is incorrect, as described in the patch body.

The motivation for this write-up is that, in a future PR, we'll extend
the /attach endpoint with an option to provide the tenant config. If we
decide to leave the non-transactional behavior of /attach unmodified, we
will be able to avoid persisting the tenant config. Conversely, if we
decide that the /attach API should become transactional, we'll need to
persist the tenant config in the attach-marker-file before acknowledging
receipt of the /attach operation.

refs neondatabase/cloud#4740
refs #2238
refs #1555
problame added a commit that referenced this issue May 16, 2023
…cessing (#4235)

With this patch, the attach handler now follows the same pattern as
tenant create with regards to instantiation of the new tenant:

1. Prepare on-disk state using `create_tenant_files`.
2. Use the same code path as pageserver startup to load it into memory
and start background loops (`schedule_local_tenant_processing`).

It's a bit sad we can't use the
`PageServerConfig::tenant_attaching_mark_file_path` method inside
`create_tenant_files` because it operates in a temporary directory.
However, it's a small price to pay for the gained simplicity.

During implementation, I noticed that we don't handle failures post
`create_tenant_files` well. I left TODO comments in the code linking to
the issue that I created for this [^1].

Also, I'll dedupe the spawn_load and spawn_attach code in a future
commit.

refs #1555
part of #886 (Tenant
Relocation)

[^1]: #4233
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants