pageserver: do image layer creation after timeline creation (or remove the code) #7197

jcsp · 2024-03-21T12:09:17Z

Background

In flush_frozen_layer we do this:

        // As a special case, when we have just imported an image into the repository,
        // instead of writing out a L0 delta layer, we directly write out image layer
        // files instead. This is possible as long as *all* the data imported into the
        // repository have the same LSN.
        let lsn_range = frozen_layer.get_lsn_range();
        let (layers_to_upload, delta_layer_to_add) =
            if lsn_range.start == self.initdb_lsn && lsn_range.end == Lsn(self.initdb_lsn.0 + 1) {

This code path isn't taken for normal timeline creations, because although we call freeze_and_flush right after creation, there is a small WAL ingest between ingesting initdb and freezing the layer.

It's mostly harmless to skip this image layer generation, because an L1 layer full of page values is not any less efficient than an image layer full of values. However, if implement compression of image layers (#5913 ) before we attempt compression of image values in delta layers, there's a benefit to writing an image layer for newly created tenants, to reduce the physical size.

Action

We should do one of these two things:

Make it so that we take this image layer generation path after normal timeline creations. This will require updating some tests, especially those that configure a tiny layer count and then make assertions about layer counts.
Or, just remove this dead code, and plan on implementing compression of image values in delta layers, such that the benefit to writing an image layer is almost nil.

The text was updated successfully, but these errors were encountered:

koivunej · 2024-06-04T16:11:34Z

Encountered an s3 recovery related problem in #7927: if we just use the "flush more often" somehow in solving this issue (like it behaves when checkpoint_distance is smaller than initdb size) we will produce 2 index_part.json updates very near one and the other. This means that s3_recovery will not work, and the test case hangs as it's waiting for the WAL part of initdb to arrive for the root timeline.

This failure mode was obscured by a number of things, but mock_s3 and real_s3 both exhibit this behaviour together with stable sort.

It of course only applies to timelines which have never had a compute started up against them. However, the first uploaded index_part.json version is meaningless and inconsistent: we can never recover using safekeepers to that Lsn because pageserver is the only one who had the WAL (and uploaded as initdb.tar.zst).

For importing really large backups, I don't think we can use the normal flush loop at all, we will need to build the image layers directly somehow.. I don't know how to do it in a streaming fashion, because we'd essentially need random access I/O to the whole fullbackup tar to do the repartitioning and splitting into image layers. An okay workaround might be to create arbitrary image layers before the imported lsn so that we can fit the fullbackup and produce "L0 deltas" (which are actually image layers, but this way they'll get to go through the compaction treatment).

As seen with the pgvector 0.7.0 index builds, we can receive large batches of images, leading to very large L0 layers in the range of 1GB. These large layers are produced because we are only able to roll the layer after we have witnessed two different Lsns in a single `DataDirModification::commit`. As the single Lsn batches of images can span over multiple `DataDirModification` lifespans, we will rarely get to write two different Lsns in a single `put_batch` currently. The solution is to remember the TimelineWriterState instead of eagerly forgetting it until we really open the next layer or someone else flushes (while holding the write_guard). Additional changes are test fixes to avoid "initdb image layer optimization" or ignoring initdb layers for assertion. Cc: #7197 because small `checkpoint_distance` will now trigger the "initdb image layer optimization"

jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Mar 21, 2024

jcsp mentioned this issue Mar 21, 2024

pageserver: track total ephemeral layer bytes #7182

Merged

5 tasks

jcsp changed the title ~~pageserver: reinstate image layer creation after timeline creation~~ pageserver: do image layer creation after timeline creation Mar 21, 2024

jcsp mentioned this issue Apr 2, 2024

initdb image layer optimization is broken (import of large pgdumps doesn't work) #5863

Closed

This was referenced May 30, 2024

feat(pageserver): support multiple key ranges for image initial flush path #7865

Closed

fix: allow layer flushes more often #7927

Merged

jcsp changed the title ~~pageserver: do image layer creation after timeline creation~~ pageserver: do image layer creation after timeline creation (or remove the code) Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: do image layer creation after timeline creation (or remove the code) #7197

pageserver: do image layer creation after timeline creation (or remove the code) #7197

jcsp commented Mar 21, 2024 •

edited

Loading

koivunej commented Jun 4, 2024 •

edited

Loading

pageserver: do image layer creation after timeline creation (or remove the code) #7197

pageserver: do image layer creation after timeline creation (or remove the code) #7197

Comments

jcsp commented Mar 21, 2024 • edited Loading

Background

Action

koivunej commented Jun 4, 2024 • edited Loading

jcsp commented Mar 21, 2024 •

edited

Loading

koivunej commented Jun 4, 2024 •

edited

Loading