Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: do image layer creation after timeline creation (or remove the code) #7197

Open
jcsp opened this issue Mar 21, 2024 · 1 comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver

Comments

@jcsp
Copy link
Collaborator

jcsp commented Mar 21, 2024

Background

See: #7182 (comment)

In flush_frozen_layer we do this:

        // As a special case, when we have just imported an image into the repository,
        // instead of writing out a L0 delta layer, we directly write out image layer
        // files instead. This is possible as long as *all* the data imported into the
        // repository have the same LSN.
        let lsn_range = frozen_layer.get_lsn_range();
        let (layers_to_upload, delta_layer_to_add) =
            if lsn_range.start == self.initdb_lsn && lsn_range.end == Lsn(self.initdb_lsn.0 + 1) {

This code path isn't taken for normal timeline creations, because although we call freeze_and_flush right after creation, there is a small WAL ingest between ingesting initdb and freezing the layer.

It's mostly harmless to skip this image layer generation, because an L1 layer full of page values is not any less efficient than an image layer full of values. However, if implement compression of image layers (#5913 ) before we attempt compression of image values in delta layers, there's a benefit to writing an image layer for newly created tenants, to reduce the physical size.

Action

We should do one of these two things:

  1. Make it so that we take this image layer generation path after normal timeline creations. This will require updating some tests, especially those that configure a tiny layer count and then make assertions about layer counts.
  2. Or, just remove this dead code, and plan on implementing compression of image values in delta layers, such that the benefit to writing an image layer is almost nil.
@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Mar 21, 2024
@jcsp jcsp changed the title pageserver: reinstate image layer creation after timeline creation pageserver: do image layer creation after timeline creation Mar 21, 2024
@jcsp jcsp changed the title pageserver: do image layer creation after timeline creation pageserver: do image layer creation after timeline creation (or remove the code) Jun 3, 2024
@koivunej
Copy link
Member

koivunej commented Jun 4, 2024

Encountered an s3 recovery related problem in #7927: if we just use the "flush more often" somehow in solving this issue (like it behaves when checkpoint_distance is smaller than initdb size) we will produce 2 index_part.json updates very near one and the other. This means that s3_recovery will not work, and the test case hangs as it's waiting for the WAL part of initdb to arrive for the root timeline.

This failure mode was obscured by a number of things, but mock_s3 and real_s3 both exhibit this behaviour together with stable sort.

It of course only applies to timelines which have never had a compute started up against them. However, the first uploaded index_part.json version is meaningless and inconsistent: we can never recover using safekeepers to that Lsn because pageserver is the only one who had the WAL (and uploaded as initdb.tar.zst).

For importing really large backups, I don't think we can use the normal flush loop at all, we will need to build the image layers directly somehow.. I don't know how to do it in a streaming fashion, because we'd essentially need random access I/O to the whole fullbackup tar to do the repartitioning and splitting into image layers. An okay workaround might be to create arbitrary image layers before the imported lsn so that we can fit the fullbackup and produce "L0 deltas" (which are actually image layers, but this way they'll get to go through the compaction treatment).

koivunej added a commit that referenced this issue Jun 10, 2024
As seen with the pgvector 0.7.0 index builds, we can receive large
batches of images, leading to very large L0 layers in the range of 1GB.
These large layers are produced because we are only able to roll the
layer after we have witnessed two different Lsns in a single
`DataDirModification::commit`. As the single Lsn batches of images can
span over multiple `DataDirModification` lifespans, we will rarely get
to write two different Lsns in a single `put_batch` currently.

The solution is to remember the TimelineWriterState instead of eagerly
forgetting it until we really open the next layer or someone else
flushes (while holding the write_guard).

Additional changes are test fixes to avoid "initdb image layer
optimization" or ignoring initdb layers for assertion.

Cc: #7197 because small `checkpoint_distance` will now trigger the
"initdb image layer optimization"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

2 participants