re-assess severity of duplicate layers: nowadays it cannot happen & we should panic/abort() early if they do #7790

problame · 2024-05-17T07:51:50Z

Background

From the time before always-authoritative index_part.json, we had to handle duplicate layers. See the RFC for an illustration of how duplicate layers could happen:

neon/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md

Lines 41 to 50 in a8e6d25

    
           The implications of the above are primarily problematic for compaction. 
        
           Specifically, the part of it that compacts L0 layers into L1 layers. 
        
           Remember that compaction takes a set of L0 layers and reshuffles the delta records in them into L1 layer files. 
        
           Once the L1 layer files are written to disk, it atomically removes the L0 layers from the layer map and adds the L1 layers to the layer map. 
        
           It then deletes the L0 layers locally, and schedules an upload of the L1 layers and and updated index part. 
        
           If we crash before deleting L0s, but after writing out L1s, the next compaction after restart will re-digest the L0s and produce new L1s. 
        
           This means the compaction after restart will **overwrite** the previously written L1s. 
        
           Currently we also schedule an S3 upload of the overwritten L1.

As of #5198 , we should not be exposed to that problem anymore.

Problem 1

But, we still have

code in Pageserver than handles duplicate layers
tests in the test suite that demonstrates the problem using a failpoint

However, the test in the test suite doesn't use the failpoint to induce a crash that could legitimately happen in production.
What is does instead is to return early with an Ok(), so that the code in Pageserver that handles duplicate layers (item 1) actually gets exercised.

That "return early" would be a bug in the routine if it happened in production.
So, the tests in the test suite are tests for their own sake, but don't serve to actually regress-test any production behavior.

Problem 2

Further, if production code did (it nowawdays doesn't!) create a duplicate layer, I think the code in Pageserver that handles that condition (item 1 above) is too little too late:

the code handles it by discarding the newer struct Layer
however, on disk, we have already overwritten the old with the new layer file
the fact that we do it atomically doesn't matter because ...
if the new layer file is not bit-identical, then we have a cache coherency problem
- PS PageCache block cache: caches old bit battern
- blob_io offsets stored in variables, based on pre-overwrite bit pattern / offsets
  - => reading based on these offsets from the new file might yield different data than before

Soution

Remove the test suite code
Remove the Pageserver code that handles duplicate layers too late
Add a panic/abort in the Pageserver code for when we'd overwrite a layer
- Use RENAME_NOREPLACE to detect this correctly

Concern originally raised in #7707 (comment)

The text was updated successfully, but these errors were encountered:

fixes #7790

koivunej · 2024-05-20T10:05:45Z

The analysis is based on the early exit which tests that compaction doesn't go into a loop where L0 compaction returns the same L0 in

neon/test_runner/regress/test_duplicate_layers.py

Line 15 in d9dcbff

def test_duplicate_layers(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):

I've since added a test case which actually tests the known duplication situation experienced with test_pageserver_chaos:

neon/test_runner/regress/test_duplicate_layers.py

Line 43 in d9dcbff

    
           def test_actually_duplicated_l1(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):

Your PR removes both test cases, but this issue only discusses the first one.

I do agree with problem 2 and the lateness. However the known duplicated situation with test_pageserver_chaos seemed to require a restart or has not been ran into otherwise (the runtime panic). Agreed we could read bad data since the actual switch on disk would had happened, but at least we would not have uploaded it to s3. Current setup was left to create noise (panic) in case we ever ran into this problem, and we haven't, before the tiered compaction work.

RENAME_NOREPLACE

I was thinking of link + unlink earlier on internal slack thread but yes, this would be simpler (I assume you did it via nix, link + unlink would have been via std). I am still however unconvinced we want to abort; I'd just add this hardening, keep the test demonstrating this behavior and fix the bug in tiered compaction. I guess I need to do a competing PR.

fixes #7790 (duplicating most of the issue description here for posterity) # Background From the time before always-authoritative `index_part.json`, we had to handle duplicate layers. See the RFC for an illustration of how duplicate layers could happen: https://github.com/neondatabase/neon/blob/a8e6d259cb49d1bf156dfc2215b92c04d1e8a08f/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md?plain=1#L41-L50 As of #5198 , we should not be exposed to that problem anymore. # Problem 1 We still have 1. [code in Pageserver](https://github.com/neondatabase/neon/blob/82960b2175211c0f666b91b5258c5e2253a245c7/pageserver/src/tenant/timeline.rs#L4502-L4521) than handles duplicate layers 2. [tests in the test suite](https://github.com/neondatabase/neon/blob/d9dcbffac37ccd3331ec9adcd12fd20ce0ea31aa/test_runner/regress/test_duplicate_layers.py#L15) that demonstrates the problem using a failpoint However, the test in the test suite doesn't use the failpoint to induce a crash that could legitimately happen in production. What is does instead is to return early with an `Ok()`, so that the code in Pageserver that handles duplicate layers (item 1) actually gets exercised. That "return early" would be a bug in the routine if it happened in production. So, the tests in the test suite are tests for their own sake, but don't serve to actually regress-test any production behavior. # Problem 2 Further, if production code _did_ (it nowawdays doesn't!) create a duplicate layer, the code in Pageserver that handles the condition (item 1 above) is too little and too late: * the code handles it by discarding the newer `struct Layer`; that's good. * however, on disk, we have already overwritten the old with the new layer file * the fact that we do it atomically doesn't matter because ... * if the new layer file is not bit-identical, then we have a cache coherency problem * PS PageCache block cache: caches old bit battern * blob_io offsets stored in variables, based on pre-overwrite bit pattern / offsets * => reading based on these offsets from the new file might yield different data than before # Solution - Remove the test suite code pertaining to Problem 1 - Move & rename test suite code that actually tests RFC-27 crash-consistent layer map. - Remove the Pageserver code that handles duplicate layers too late (Problem 1) - Use `RENAME_NOREPLACE` to prevent over-rename the file during `.finish()`, bail with an error if it happens (Problem 2) - This bailing prevents the caller from even trying to insert into the layer map, as they don't even get a `struct Layer` at hand. - Add `abort`s in the place where we have the layer map lock and check for duplicates (Problem 2) - Note again, we can't reach there because we bail from `.finish()` much earlier in the code. - Share the logic to clean up after failed `.finish()` between image layers and delta layers (drive-by cleanup) - This exposed that test `image_layer_rewrite` was overwriting layer files in place. Fix the test. # Future Work This PR adds a new failure scenario that was previously "papered over" by the overwriting of layers: 1. Start a compaction that will produce 3 layers: A, B, C 2. Layer A is `finish()`ed successfully. 3. Layer B fails mid-way at some `put_value()`. 4. Compaction bails out, sleeps 20s. 5. Some disk space gets freed in the meantime. 6. Compaction wakes from sleep, another iteration starts, it attempts to write Layer A again. But the `.finish()` **fails because A already exists on disk**. The failure in step 5 is new with this PR, and it **causes the compaction to get stuck**. Before, it would silently overwrite the file and "successfully" complete the second iteration. The mitigation for this is to `/reset` the tenant.

problame added the c/storage/pageserver Component: storage: pageserver label May 17, 2024

problame self-assigned this May 17, 2024

problame mentioned this issue May 17, 2024

tiered compaction: duplicated L1 layer error in test_deletion_queue_recovery #7707

Open

problame removed their assignment May 17, 2024

problame changed the title ~~re-assess severity of duplicate layers~~ re-assess severity of duplicate layers: nowadays the cannot happen & we should panic/abort() early if they do May 17, 2024

problame changed the title ~~re-assess severity of duplicate layers: nowadays the cannot happen & we should panic/abort() early if they do~~ re-assess severity of duplicate layers: nowadays it cannot happen & we should panic/abort() early if they do May 17, 2024

problame added a commit that referenced this issue May 17, 2024

WIP: abort on duplicate layers & fail early

7e91015

fixes #7790

problame mentioned this issue May 17, 2024

fix(pageserver): duplicate layers can cause corruption #7799

Merged

jcsp mentioned this issue May 20, 2024

Duplicate L1 generation #5077

Closed

problame closed this as completed in #7799 Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-assess severity of duplicate layers: nowadays it cannot happen & we should panic/abort() early if they do #7790

re-assess severity of duplicate layers: nowadays it cannot happen & we should panic/abort() early if they do #7790

problame commented May 17, 2024 •

edited

Loading

koivunej commented May 20, 2024

re-assess severity of duplicate layers: nowadays it cannot happen & we should panic/abort() early if they do #7790

re-assess severity of duplicate layers: nowadays it cannot happen & we should panic/abort() early if they do #7790

Comments

problame commented May 17, 2024 • edited Loading

Background

Problem 1

Problem 2

Soution

koivunej commented May 20, 2024

problame commented May 17, 2024 •

edited

Loading