Epic: pageserver image layer compression #5431

jcsp · 2023-10-02T08:30:49Z

Background

We may substantially decrease the capacity & bandwidth footprint of tenants by compressing data in their image layers.

There are many possible implementations, from compressing whole layers files as streams, to introducing some chunked format and decompressing a chunk at a time, to simply compressing individual pages.

Compressing individual pages in image layers is by far the simplest thing to do, and should have a high payoff as:

image layers are often the majority of a tenant's storage footprint.
image layers provide 8kib pages that should be large enough to meaningfully compress.

Compressing deltas is a harder problem (individual deltas are likely too small to usefully compress), and is left as a possible future change.

Implementation

There is a preliminary version here: #7091, which demonstrates that per-page compression in image layers may be added as a relatively lightweight code change.

To get this ready for production, there is more work to do:

Evaluate compression algorithms on realistic datasets. We should analyze:
- zstd
- LZ4
- zstd/LZ4 plus dictionaries: we could craft a dictionary-per-layer to get better compression of each page in the layer.
- Pay particular attention to read performance: this is the part that will be in the hot path for getpage latency.
Revise page header format to enable stashing compression flags -- we currently have a four byte header which is gratuitously large, and we should be able to store compression info in there without adding more header bytes (discussed at Compress image layer #7091 (comment))
Handle compressed user data efficiently: if the user's data is already compressed, we should detect that and avoid re-compressing it on the pageserver (discussed at Compress image layer #7091 (comment))
Define a phased roll-out approach: there maybe significantly more CPU load once compression is in use.

PRs/issues

Rollout

The text was updated successfully, but these errors were encountered:

We'd like to get some bits reserved in the length field of image layers for future usage (compression). This PR bases on the assumption that we don't have any blobs that require more than 28 bits (3 bytes + 4 bits) to store the length, but as a preparation, before erroring, we want to first emit warnings as if the assumption is wrong, such warnings are less disruptive than errors. A metric would be even less disruptive (log messages are more slow, if we have a LOT of such large blobs then it would take a lot of time to print them). At the same time, likely such 256 MiB blobs will occupy an entire layer file, as they are larger than our target size. For layer files we already log something, so there shouldn't be a large increase in overhead. Part of #5431

problame · 2024-05-27T13:37:00Z

Last week:

arpad wrote a tool to compress image layers Add a pagectl tool to recompress image layers #7879

This week:

identify interesting / representative tenants / layers
determine achievable space savings by running the tool against the identified layers

koivunej · 2024-06-10T13:44:32Z

This week:

implement decompression
compare decompression speed
have a meeting with Konstantin, Stas, and John later this week
- which algorithm is chosen right now

Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431

…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431

@koivunej

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size. To be used like: ``` AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas ``` Part of #5431

This flattens the compression algorithm setting, removing the `Option<_>` wrapping layer and making handling of the setting easier. It also adds a specific setting for *disabled* compression with the continued ability to read copmressed data, giving us the option to more easily back out of a compression rollout, should the need arise, which was one of the limitations of #8238. Implements my suggestion from #8238 (comment) , inspired by Christian's review in #8238 (review) . Part of #5431

Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431

The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431

Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431

…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431

Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431

The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431

Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431

…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431

@koivunej

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size. To be used like: ``` AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas ``` Part of #5431

This flattens the compression algorithm setting, removing the `Option<_>` wrapping layer and making handling of the setting easier. It also adds a specific setting for *disabled* compression with the continued ability to read copmressed data, giving us the option to more easily back out of a compression rollout, should the need arise, which was one of the limitations of #8238. Implements my suggestion from #8238 (comment) , inspired by Christian's review in #8238 (review) . Part of #5431

Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431

The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431

Removes the `ImageCompressionAlgorithm::DisabledNoDecompress` variant. We now assume any blob with the specific bits set is actually a compressed blob. The `ImageCompressionAlgorithm::Disabled` variant still remains and is the new default. Reverts large parts of #8238 , as originally intended in that PR. Part of #5431

We need to pass on the configured compression param during image layer generation. This was an oversight of #8106, and the likely cause why #8288 didn't bring any interesting regressions. Part of #5431

Implement decompression of images for vectored reads. This doesn't implement support for still treating blobs as uncompressed with the bits we reserved for compression, as we have removed that functionality in #8300 anyways. Part of #5431

arpad-m · 2024-07-12T16:22:20Z

Week Jul 1-5:

big implementation week, filed many PRs:
we support configuring compression now via the image_compression config param.
wrote scrubber subcommand to look for image layers > 250M bytes.
ran this scrubber subcommand against prod. there were none such image layers, this implies that there is no blob above that limit either, which allows us to use the bits encoding such large blobs for different purposes.

Week Jul 8-12:

deployed release now forbids writing blobs >=256MiB, both to image and delta layers.
ran scrubber again after the release, to ensure no blob >=256MiB was added in the window between the first scrubber run and the release, according to Christian's plan: Only support compressed reads if the compression setting is present #8238 (comment)
big testing week. debugged in:
- Enable zstd in tests #8288
- Make vectored read_blobs function not fill buffer correctly #8324
got PRs from last week merged:
- Remove ImageCompressionAlgorithm::DisabledNoDecompress #8300
- Implement decompression for vectored reads #8302
testing found settings passing oversight. filed Only support compressed reads if the compression setting is present #8238 for it and got it merged.
New testing PR: Enable zstd in tests #8368

Removes the `ImageCompressionAlgorithm::DisabledNoDecompress` variant. We now assume any blob with the specific bits set is actually a compressed blob. The `ImageCompressionAlgorithm::Disabled` variant still remains and is the new default. Reverts large parts of #8238 , as originally intended in that PR. Part of #5431

We need to pass on the configured compression param during image layer generation. This was an oversight of #8106, and the likely cause why #8288 didn't bring any interesting regressions. Part of #5431

Implement decompression of images for vectored reads. This doesn't implement support for still treating blobs as uncompressed with the bits we reserved for compression, as we have removed that functionality in #8300 anyways. Part of #5431

Successor of #8288 , just enable zstd in tests. Also adds a test that creates easily compressable data. Part of #5431 --------- Co-authored-by: John Spray <john@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>

If compression is enabled, we currently try compressing each image larger than a specific size and if the compressed version is smaller, we write that one, otherwise we use the uncompressed image. However, this might sometimes be a wasteful process, if there is a substantial amount of images that don't compress well. The compression metrics added in #8420 `pageserver_compression_image_in_bytes_total` and `pageserver_compression_image_out_bytes_total` are well designed for answering the question how space efficient the total compression process is end-to-end, which helps one to decide whether to enable it or not. To answer the question of how much waste there is in terms of trial compression, so CPU time, we add two metrics: * one about the images that have been trial-compressed (considered), and * one about the images where the compressed image has actually been written (chosen). There is different ways of weighting them, like for example one could look at the count, or the compressed data. But the main contributor to compression CPU usage is amount of data processed, so we weight the images by their *uncompressed* size. In other words, the two metrics are: * `pageserver_compression_image_in_bytes_considered` * `pageserver_compression_image_in_bytes_chosen` Part of #5431

After the rollout has succeeded, we now set the default image compression to be enabled. We also remove its explicit mention from `neon_fixtures.py` added in #8368 as it is now the default (and we switch to `zstd(1)` which is a bit nicer on CPU time). Part of #5431

koivunej · 2024-08-26T13:34:41Z

From @Bodobolero's benchmarks: add lz4 support for comparison.

arpad-m · 2024-08-26T15:13:06Z

we talked about this in the call and agreed that until further investigation in which compression is identified as culprit, we will not spend developer time on this.

arpad-m · 2024-09-02T12:10:50Z

I think this can be closed now.

jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Oct 2, 2023

MMeent mentioned this issue Jan 19, 2024

Epic: Safekeeper S3 (WAL segment) compression #6409

Open

jcsp mentioned this issue Mar 28, 2024

Data compression in the pageserver #548

Closed

jcsp changed the title ~~Epic: pageserver compression~~ Epic: pageserver image layer compression Apr 15, 2024

jcsp assigned arpad-m May 20, 2024

arpad-m mentioned this issue May 22, 2024

Warn if a blob in an image is larger than 256 MiB #7852

Merged

arpad-m mentioned this issue May 24, 2024

Add a pagectl tool to recompress image layers #7879

Closed

arpad-m mentioned this issue Jun 19, 2024

Add support for reading and writing compressed blobs #8106

Merged

arpad-m mentioned this issue Jul 2, 2024

Only support compressed reads if the compression setting is present #8238

Merged

This was referenced Jul 3, 2024

Use bool param for round_trip_test_compressed #8252

Merged

Add find-large-objects subcommand to scrubber #8257

Merged

arpad-m added a commit that referenced this issue Jul 4, 2024

Use bool param for round_trip_test_compressed (#8252)

a004d27

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

arpad-m mentioned this issue Jul 4, 2024

Flatten compression algorithm setting #8265

Merged

This was referenced Jul 5, 2024

Improve parsing of ImageCompressionAlgorithm #8281

Merged

Enable zstd in tests #8288

Closed

Add concurrency to the find-large-objects scrubber subcommand #8291

Merged

This was referenced Jul 6, 2024

Remove ImageCompressionAlgorithm::DisabledNoDecompress #8300

Merged

Implement decompression for vectored reads #8302

Merged

VladLazar pushed a commit that referenced this issue Jul 8, 2024

Use bool param for round_trip_test_compressed (#8252)

0a63bc4

As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431

arpad-m mentioned this issue Jul 11, 2024

Pass configured compression param to image generation #8363

Merged

arpad-m mentioned this issue Jul 12, 2024

Enable zstd in tests #8368

Merged

arpad-m mentioned this issue Jul 26, 2024

Add metrics for input data considered and taken for compression #8522

Merged

arpad-m mentioned this issue Aug 9, 2024

Default image compression to zstd at level 1 #8677

Merged

arpad-m closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: pageserver image layer compression #5431

Epic: pageserver image layer compression #5431

jcsp commented Oct 2, 2023 •

edited by arpad-m

Loading

problame commented May 27, 2024 •

edited

Loading

koivunej commented Jun 10, 2024

arpad-m commented Jul 12, 2024

koivunej commented Aug 26, 2024

arpad-m commented Aug 26, 2024

arpad-m commented Sep 2, 2024

Epic: pageserver image layer compression #5431

Epic: pageserver image layer compression #5431

Comments

jcsp commented Oct 2, 2023 • edited by arpad-m Loading

Background

Implementation

PRs/issues

Rollout

problame commented May 27, 2024 • edited Loading

koivunej commented Jun 10, 2024

arpad-m commented Jul 12, 2024

Week Jul 1-5:

Week Jul 8-12:

koivunej commented Aug 26, 2024

arpad-m commented Aug 26, 2024

arpad-m commented Sep 2, 2024

jcsp commented Oct 2, 2023 •

edited by arpad-m

Loading

problame commented May 27, 2024 •

edited

Loading