Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: two concurrent timeline creations do not coalesce #7208

Closed
koivunej opened this issue Mar 22, 2024 · 3 comments · Fixed by #7225
Closed

pageserver: two concurrent timeline creations do not coalesce #7208

koivunej opened this issue Mar 22, 2024 · 3 comments · Fixed by #7225
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@koivunej
Copy link
Member

Before #6139, two identical timeline creation requests started in a retrying-until-complete situation:

  1. the 1st request spawns off the operation, which takes more than the client is willing to wait
  2. the 2nd request before the spawned-off operation completes

Before #6139, the uploads were awaited, guaranteeing that the original operation would have been completed.

Currently, the 2nd request is responded to with 409 immediately instead of waiting for the operation to be completed.

The fix would be to wait for the operation's completion before returning at all.

Context: https://neondb.slack.com/archives/C06K38EB05D/p1711114590911899?thread_ts=1711093964.514849&cid=C06K38EB05D

Returning of AlreadyCreating:

neon/pageserver/src/tenant.rs

Lines 1443 to 1446 in 62b318c

// Creation is in progress, we cannot create it again, and we cannot
// check if this request matches the existing one, so caller must try
// again later.
return Err(CreateTimelineError::AlreadyCreating);

AlreadyCreating is converted to 409:

Err(
e @ tenant::CreateTimelineError::Conflict
| e @ tenant::CreateTimelineError::AlreadyCreating,
) => json_response(StatusCode::CONFLICT, HttpErrorBody::from_msg(e.to_string())),

@koivunej koivunej added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Mar 22, 2024
@jcsp
Copy link
Collaborator

jcsp commented Mar 22, 2024

Before #6139, the uploads were awaited, guaranteeing that the original operation would have been completed.

That part didn't change. If we receive a creation request and the timeline already exists, we still wait for its remote uploads.

The only change is in the case that the uninitialized guard exists: we used to return 500 (which was a bug), and now we return 409.

In general, I don't consider it a bug that we don't coalesce identical repeated requests. As long as repeated requests eventually succeed, that's okay. We can tweak the response code if necessary, although I don't want to use 503 for this: 503 is used for "I can't handle this request right now", but this situation is more like "You asked me to do two things at the same time that can't happen at the same time"

@koivunej
Copy link
Member Author

I agree, but I failed to see this last week; the behavior is unchanged for a very short period of locking.

@jcsp
Copy link
Collaborator

jcsp commented Mar 25, 2024

We can make this a bit nicer for clients by distinguishing true conflicts from creations in progress, by reporting the latter as 429:

jcsp added a commit that referenced this issue Mar 26, 2024
## Problem

Currently, we return 409 (Conflict) in two cases:
- Temporary: Timeline creation cannot proceed because another timeline
with the same ID is being created
- Permanent: Timeline creation cannot proceed because another timeline
exists with different parameters but the same ID.

Callers which time out a request and retry should be able to distinguish
these cases.

Closes: #7208 

## Summary of changes

- Expose `AlreadyCreating` errors as 429 instead of 409
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants