Skip to content

Commit

Permalink
fix(l0_flush): drops permit before fsync, potential cause for OOMs (#…
Browse files Browse the repository at this point in the history
…8327)

## Problem

Slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1720511577862519

We're seeing OOMs in staging on a pageserver that has
l0_flush.mode=Direct enabled.

There's a strong correlation between jumps in `maxrss_kb` and
`pageserver_timeline_ephemeral_bytes`, so, it's quite likely that
l0_flush.mode=Direct is the culprit.

Notably, the expected max memory usage on that staging server by the
l0_flush.mode=Direct is ~2GiB but we're seeing as much as 24GiB max RSS
before the OOM kill.

One hypothesis is that we're dropping the semaphore permit before all
the dirtied pages have been flushed to disk. (The flushing to disk
likely happens in the fsync inside the `.finish()` call, because we're
using ext4 in data=ordered mode).

## Summary of changes

Hold the permit until after we're done with `.finish()`.
  • Loading branch information
problame committed Jul 9, 2024
1 parent 3f7aebb commit 9bb16c8
Showing 1 changed file with 11 additions and 5 deletions.
16 changes: 11 additions & 5 deletions pageserver/src/tenant/storage_layer/inmemory_layer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -715,16 +715,22 @@ impl InMemoryLayer {
res?;
}
}

// Hold the permit until the IO is done; if we didn't, one could drop this future,
// thereby releasing the permit, but the Vec<u8> remains allocated until the IO completes.
// => we'd have more concurrenct Vec<u8> than allowed as per the semaphore.
drop(_concurrency_permit);
}
}

// MAX is used here because we identify L0 layers by full key range
let delta_layer = delta_layer_writer.finish(Key::MAX, timeline, ctx).await?;

// Hold the permit until all the IO is done, including the fsync in `delta_layer_writer.finish()``.
//
// If we didn't and our caller drops this future, tokio-epoll-uring would extend the lifetime of
// the `file_contents: Vec<u8>` until the IO is done, but not the permit's lifetime.
// Thus, we'd have more concurrenct `Vec<u8>` in existence than the semaphore allows.
//
// We hold across the fsync so that on ext4 mounted with data=ordered, all the kernel page cache pages
// we dirtied when writing to the filesystem have been flushed and marked !dirty.
drop(_concurrency_permit);

Ok(Some(delta_layer))
}
}

1 comment on commit 9bb16c8

@github-actions
Copy link

@github-actions github-actions bot commented on 9bb16c8 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3129 tests run: 3002 passed, 0 failed, 127 skipped (full report)


Flaky tests (2)

Postgres 14

  • test_basebackup_with_high_slru_count[github-actions-selfhosted-sequential-10-13-30]: release
  • test_basebackup_with_high_slru_count[github-actions-selfhosted-vectored-10-13-30]: release

Code coverage* (full report)

  • functions: 32.6% (6941 of 21292 functions)
  • lines: 50.0% (54547 of 109094 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
9bb16c8 at 2024-07-09T23:05:29.914Z :recycle:

Please sign in to comment.