Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: send startup compute notifications in background #7495

Merged
merged 4 commits into from
Apr 29, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Apr 24, 2024

Problem

Previously, we try to send compute notifications in startup_reconcile before completing that function, with a time limit. Any notifications that don't happen within the time limit result in tenants having their pending_compute_notification flag set, which causes them to spawn a Reconciler next time the background reconciler loop runs.

This causes two problems:

When the code was first written, ComputeHook didn't have internal ordering logic to ensure that notifications for a shard were sent in the right order. Since that was added in #7088, we can use it to avoid waiting for notifications to complete in startup_reconcile.

Related to: #7460

Summary of changes

  • Add a notify_background method to ComputeHook.
  • Call this from startup_reconcile instead of doing notifications inline
  • Process completions from notify_background in process_results, and if a notification failed then set the pending_compute_notification flag on the shard.

The result is that we will only spawn lots of Reconcilers if the compute notifications fail, not just because they take some significant amount of time.

Test coverage for this case is in #7475

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/bug Issue Type: Bug c/storage/controller Component: Storage Controller labels Apr 24, 2024
@jcsp jcsp requested a review from a team as a code owner April 24, 2024 09:08
@jcsp jcsp requested a review from petuhovskiy April 24, 2024 09:08
Copy link

github-actions bot commented Apr 24, 2024

2820 tests run: 2699 passed, 0 failed, 121 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

  • functions: 28.2% (6551 of 23257 functions)
  • lines: 46.8% (46240 of 98781 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2119af7 at 2024-04-29T09:06:27.174Z :recycle:

Copy link
Member

@petuhovskiy petuhovskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Looked through the storage-controller code, now more or less understand how it works

@jcsp jcsp enabled auto-merge (squash) April 29, 2024 08:28
@petuhovskiy
Copy link
Member

Giving it another thought, I think ComputeHook service can just have a single unbounded channel for accepting requests and it doesn't need to have notify(), notify_background() or any other pub functions. For notifying about errors, it can have a single outgoing channel that will be processed in process_results.

@jcsp jcsp merged commit 8491443 into main Apr 29, 2024
49 checks passed
@jcsp jcsp deleted the jcsp/storcon-background-notify branch April 29, 2024 08:59
@jcsp
Copy link
Collaborator Author

jcsp commented Apr 29, 2024

Giving it another thought, I think ComputeHook service can just have a single unbounded channel for accepting requests and it doesn't need to have notify(), notify_background() or any other pub functions. For notifying about errors, it can have a single outgoing channel that will be processed in process_results.

The important characteristics of notify() are:

  • is that it shouldn't complete until the notification has been processed -- when we do something like a migration, we must not detach the old location until we know the compute has been notified of the new location.
  • one tenant's notification does not block another tenants (e.g. if we had a global channel for notifications, we couldn't pause processing it while we wait for one tenant's notifications to complete).

Those behaviors could indeed still be implemented on top of channels though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants