Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer Manager Eviction Rework #2137

Closed
ray6080 opened this issue Oct 3, 2023 · 0 comments · Fixed by #3620
Closed

Buffer Manager Eviction Rework #2137

ray6080 opened this issue Oct 3, 2023 · 0 comments · Fixed by #3620

Comments

@ray6080
Copy link
Contributor

ray6080 commented Oct 3, 2023

Overview

Currently we adopt a queue-based eviction policy. A concurrent queue is used to keep track of all marked pages as eviction candidates. This poses overheads to unpin, as each call of this function needs to switch unlocked pages to marked, and add it to the eviction queue.

The alternative solution described in Virtual-Memory Assisted Buffer Management1 is to store all cached page identifiers in a fixed-size open addressing hash table. The size of the hash table is equal to the number of pages in DRAM (bounded by buffer manager memory limit). Instead of accessing eviction queue in each unpin, hash table is not accessed during cache hits and unpin, but only during page faults and eviction.
Moreover, the hash table makes it easier to loop over, and clean up page states from BM as we sometimes need to removeFilePagesFromFrames.

This issue proposes to replace the concurrent queue with a fixed-size hash table.

Page State and Eviction

The design of our buffer manager follows vmCache. See Section 3 in paper Virtual-Memory Assisted Buffer Management for detailed description.

Page State

Each page is associated with a 64-bit state entry. The entry encodes both state and version counter. 56 bits are used for the version counter. and the left 8 bits are used for state, including LOCKED, UNLOCKED, MARKED, EVICTED. Any page must be in one of these four states. The version counter is used for optimistic read. See our code comment here for more details .

Eviction Algorithm

This issue proposes to implement hash table-based CLOCK algorithm.
The key idea is to find an old page to evict, not necessarily the oldest. We loop through cached pages in the hash table until we find one that hasn't been referenced since the last time we check it.

Eviction happens at batches of 64 pages at a time, and is performed in five steps:

  1. loop over the hash table, and get a batch of marked candidates from hash table.
  2. write dirty pages (optionally, this can be done with async io).
  3. try to lock clean page candidates.
  4. remove locked pages from page table using madvise.
  5. remove locked pages from eviction hash table, unlock them.

Note: Currently, we already ensure dirty pages are locked. No need to lock and upgrade dirty pages as described in vmcache.

See code here for implementation details in vmcache.

Changes to BM

  1. Remove addToEvictionQueue from unpin .
  2. Rework claimAFrame to be on top of the hash table based batch eviction.

Suggested Testings

As BM is critical to the system, and complicated enough to be not easily tested. We suggest following ways (but not limited to these) to test it when open an PR to fix this issue. Using logger is a good way to ensure the correctness of the algorithm. Also, set eviction batch size to 1 can simplify log tracing.

Existing tests
After changes, existing tests should still be able to pass (make test).

Small BM size
Try to set BM to a very small size to ensure BM can correctly bookkeep memory usage. For example, running query MATCH (c:Comment) RETURN min(c.length) requires two pinned 256KB pages, and one an extra 4KB page should be enough to evaluate the query on LDBC1 comment table with a single thread.

Repeated query with large BM size
Given enough BM buffer, running the same query twice: the first run will trigger cache of all necessary pages, and the second run will trigger no reads from files, and no evictions; or running the same query concurrently: each page should only be read once from file, and no evictions are triggered.

Repeated query with limited BM size
Given limited BM buffer, for example, around half of needed disk pages, running the same query twice. The first run will trigger read of all pages from file, and also evictions of around half of pages read from file. The second run will also trigger eviction of pages read from the first run, and re-read pages from files. If carefully designed, it's possible to see each page is read twice from file, and evicted at least once.

Detailed CLOCK behaviour tracing
Given a small BM buffer, for example, 1MB, run a simple scan query (for example, MATCH (c:Comment) RETURN min(c.length)) in single thread, and log changes to states of pages (e.g., pin a page to LOCKED and add it into hash table, unpin a page to UNLOCKED, marking unlocked pages to MARK, lock marked pages to evict them, remove evicted pages and unlock them, etc.) and clock hand position in the cache hash table. The tracing should be able to be verified with careful simulation of CLOCK behaviour.

Large scale queries on LDBC100
Run our benchmark queries on LDBC100 dataset to validate it works on some level of large scale dataset with multiple threads.

Footnotes

  1. Leis, Viktor, et al. "Virtual-Memory Assisted Buffer Management." SIGMOD'23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant