x86: mmu/demand_paging: supports LRU eviction algorithm #75140

dcpleung · 2024-06-27T22:08:07Z

The LRU (Least Recently Used) requires additional accounting. So add the code for that.

dcpleung · 2024-06-27T22:09:39Z

@npitre I actually cannot run this with LRU most of the time, as the LRU code would assert on pf_idx != 0 during boot at mark_linker_section_pinned().

npitre · 2024-06-28T02:04:13Z

Well... Allright. Fix in #75142.

This adds bits to simulate LRU in software for architectures that do not provide a native mean to figure out least recently used pages. Enabling this changes the eviction selection to clear the access flag if it is set and requeues the page frame for later consideration. Without this, the LRU would simply devolve into picking the earliest paged in frame for eviction. Signed-off-by: Daniel Leung <daniel.leung@intel.com>

The LRU (Least Recently Used) requires additional accounting. So add the code for that. Fixes zephyrproject-rtos#75132 Signed-off-by: Daniel Leung <daniel.leung@intel.com>

dcpleung · 2024-08-27T19:20:07Z

Finally got it all working and passing twister with LRU enabled.

npitre · 2024-08-27T20:13:34Z

I'm wondering about CONFIG_EVICTION_LRU_SW_SIMULATED that you added.

If I understandit correctly, x86 cannot trigger an exception when a page
has its accessed bit unset and it is set by the hardware on actual access.
Am I right?

dcpleung · 2024-08-27T20:18:03Z

Yes. The hardware only sets the bit when a page is accessed. It is up to software to clear the bit.

npitre · 2024-08-27T21:00:58Z

Yes. The hardware only sets the bit when a page is accessed. It is up to
software to clear the bit.

In that case... can't you tweak things a little so an "unaccessed" page is
made unaccessible so an actual exception is triggered when accessed? The crux
of this code really depends on k_mem_paging_eviction_accessed() being invoked
each time a page changes state and in the proper order.

You might have to use another flag to distinguish between really
unaccessible pages and those made unaccessible for the purpose of catching
the next access to them but that's simple enough. On ARM64, a page mapping
may be unaccessible for 3 reasons: it is invalid, the page is paged out or
the page is still unaccessed.

The problems I see with CONFIG_EVICTION_LRU_SW_SIMULATED is that you lose
the realtime usage sorting. It is no longer the least recently used page
you're getting but rather some semi-random unused one. You also lose the
O(1) property. And when all pages have been used then this more or less
degrades to the NRU behavior.

dcpleung · 2024-08-27T23:37:27Z

In that case... can't you tweak things a little so an "unaccessed" page is made unaccessible so an actual exception is triggered when accessed? The crux of this code really depends on k_mem_paging_eviction_accessed() being invoked each time a page changes state and in the proper order.

You might have to use another flag to distinguish between really unaccessible pages and those made unaccessible for the purpose of catching the next access to them but that's simple enough. On ARM64, a page mapping may be unaccessible for 3 reasons: it is invalid, the page is paged out or the page is still unaccessed.

That would require clearing the present bit so that any access would generate a page fault. However, handling page fault is expensive on x86. There is no shortcut or "fast interrupt" way of dealing with it.

The problems I see with CONFIG_EVICTION_LRU_SW_SIMULATED is that you lose the realtime usage sorting. It is no longer the least recently used page you're getting but rather some semi-random unused one. You also lose the O(1) property. And when all pages have been used then this more or less degrades to the NRU behavior.

A more complete and expensive version would be to rearrange periodically. So instead of realtime usage, we sample at fixed interval. Any pages that have been accessed in this period will be moved to the end of queue. So any pages that have not been used for a while would be moved to the front of queue gradually. I did it the way in this PR is to avoid the full rearranging as that would require locking the whole queue, so the code is to do it partially each time an eviction is needed.

npitre · 2024-08-28T00:24:10Z

That would require clearing the present bit so that any access would
generate a page fault. However, handling page fault is expensive on x86.
There is no shortcut or "fast interrupt" way of dealing with it.

But how expensive is it really?

A more complete and expensive version would be to rearrange periodically.
So instead of realtime usage, we sample at fixed interval. Any pages that
have been accessed in this period will be moved to the end of queue. So
any pages that have not been used for a while would be moved to the front
of queue gradually.

And that is certainly going to be more expensive than as-needed page fault
handling, no?

dcpleung · 2024-08-28T16:28:47Z

That would require clearing the present bit so that any access would
generate a page fault. However, handling page fault is expensive on x86.
There is no shortcut or "fast interrupt" way of dealing with it.

But how expensive is it really?

The entrance and exit assembly code before and after any C code have about 70 instructions (in worst case). Most of them are push/pop needing access to main memory. So latency would be even worse if that memory is not already in L1 cache. I took a quick glance at the C page fault handler, and I think marking page being accessed can be done early in that function and should not add too much delays. Then of course the function call to the LRU function to mark page accessed.

With all that said, however, I believe that any memory access to pages already in physical memory should not have any additional cost. That it should act as if demand paging is not enabled.

A more complete and expensive version would be to rearrange periodically.
So instead of realtime usage, we sample at fixed interval. Any pages that
have been accessed in this period will be moved to the end of queue. So
any pages that have not been used for a while would be moved to the front
of queue gradually.

And that is certainly going to be more expensive than as-needed page fault handling, no?

That would depend on how often one goes through the accessed page marking.

Though, once you need to involve backing store (i.e. paging in/out memory), the cost is already high. Worst case would be needing to evict a dirty page which requires writing data to backing store.

I guess at the end it is a game of where we stash the big baggage of processing.

npitre · 2024-08-28T17:29:56Z

I guess at the end it is a game of where we stash the big baggage of
processing.

Here's what my scenario looks like.

RAM size is 8MB. Firmware size is 16MB or more. About half of RAM is
dedicated to demand-paging of non-resident firmware text+rodata. That
represents 1024 page frames.

At the beginning, many page frames will be free. So for a while there won't
be any eviction. Pages will be paged in and they will be accessed right away
given it is the access that paged them in. No point making them
unaccessed/unaccessible.

When free page frames are exhausted, the eviction algorithm is invoked.
But now, with CONFIG_EVICTION_LRU_SW_SIMULATED you have to loop over 1000
frames to find a victim as they're all marked "accessed". The larger memory
is in this scenario, the more painful this loop will be. Here's your big
baggage of processing right there with a huge unexpected spike on your
latency graph as this is also done with a lock held.

Without CONFIG_EVICTION_LRU_SW_SIMULATED it is the head frame that is
returned. Initially it would correspond to the first page that was ever
queued. Same as above but without any loop. And the new head frame is made
unaccessible so k_mem_paging_eviction_accessed() can be invoked if
this page is accessed again before the next eviction request.

So maybe the page fault to invoke k_mem_paging_eviction_accessed() adds
some extra processing, but it is a "minor" fault with a bounded constant
overhead spread over time and not bunched in a loop which is arguably a nicer
behavior for an RTOS.

Maybe at least making this mode of operation an option on x86?

dcpleung · 2024-08-28T19:16:01Z

I need some clarification here... From your description here, k_mem_paging_eviction_accessed() is only called once after the page is pulled in and is accessed the first time? But the access is guaranteed here due to the page fault => a page is evicted and another page is pulled in because we need to access memory in the later page. So there is no need to mark it unaccessible and wait for the first access.

Moreover, the fulfill the scheme of least recently used, k_mem_paging_eviction_accessed() needs to be called frequently enough (ideally, every access). If it is only called once when first accessed, the eviction would just be selecting the oldest page.

npitre · 2024-08-28T20:45:14Z

I need some clarification here... From your description here,
k_mem_paging_eviction_accessed() is only called once after the page is
pulled in and is accessed the first time? But the access is guaranteed
here due to the page fault => a page is evicted and another page is pulled
in because we need to access memory in the later page. So there is no need
to mark it unaccessible and wait for the first access.

Well, given that later page is appended to the LRU queue with
k_mem_paging_eviction_add() before execution is resumed, there is no need
to make the new page unaccessible for the purpose of trapping the first
access in order to call k_mem_paging_eviction_accessed() as the end result
would be the same. So k_mem_paging_eviction_accessed() does not have to be
called on the first access.

Moreover, the fulfill the scheme of least recently used,
k_mem_paging_eviction_accessed() needs to be called frequently enough
(ideally, every access). If it is only called once when first accessed, the
eviction would just be selecting the oldest page.

Of course, "every access" is unrealistic. And yes, the first time a page
eviction is required, it is the oldest page that is selected. But that's where
the actual sorting starts. When the page at the head of the queue is selected,
the new head page is then made unaccessible. If that head page is still in
use, it will fault before the next eviction selection event and then moved
to the queue tail as "recently used" by k_mem_paging_eviction_accessed()
and the new head made unaccessible for it to be tested. So the next time
a page eviction is needed, the head page is likely not to have been used.

Now this is only the head page that is made unaccessible to test if it is
actually used before it is reclaimed. But we could have more of them. If, say,
half of the queued pages on the head side were made unaccessible, then you'd
get very close to real LRU (full LRU would imply all but one page made
unaccessible) . But the page fault overhead might then become too much.
So I went for the simplest i.e. only the head page for now.

dcpleung · 2024-08-28T20:57:45Z

Ah... okay. Let me see what I can do.

dcpleung · 2024-08-30T21:29:10Z

Hm... It is getting tricky on x86. Bits in PTE that are ignored by the CPU have already been re-purposed, so there are no available bits to encode the "not present but still valid" state such that k_mem_paging_eviction_accessed() can be invoked on that page. So it would need another data structure to keep track of this info. And since we are doing demand paging, memory is already a premium so it is a hard sell. Not to mention this extra data structure will also be subjected to eviction which adds more processing time.

Also piggyback the "clearing the access bit of the queue head" to arch_page_info_get() may not be such a good idea after all. You can clear the access bit without the page being the head of queue. I will comment this with more details on the demand paging PR.

npitre · 2024-08-31T02:32:42Z

Hm... It is getting tricky on x86. Bits in PTE that are ignored by the
CPU have already been re-purposed, so there are no available bits to
encode the "not present but still valid" state such that
k_mem_paging_eviction_accessed() can be invoked on that page.

How do you do, then, to support evicted pages? Such a page must fault yet
its pte must still hold the backing store location cookie and the
read/write/exec permissions for that page once loaded back in.

So it would need another data structure to keep track of this info.

Nah... you must be able to do it in the pte otherwise it isn't worth it.

dcpleung · 2024-09-03T17:27:30Z

Hm... It is getting tricky on x86. Bits in PTE that are ignored by the
CPU have already been re-purposed, so there are no available bits to
encode the "not present but still valid" state such that
k_mem_paging_eviction_accessed() can be invoked on that page.

How do you do, then, to support evicted pages? Such a page must fault yet its pte must still hold the backing store location cookie and the read/write/exec permissions for that page once loaded back in.

It is not about evicted pages, but to support the 'exception when access bit changes from 0 to 1'. The other bits of PTE need to be preserved in this case. Only the present bit is cleared to generate a "fake" page fault where k_mem_paging_eviction_accessed() can be called. Then the present bit is flipped and the PTE is valid again. There are some bits reserved in the PTE where they are ignored by the CPU and are there only for software to use, even when the PTE is valid and being actively used. However, those bits have already be used for other purposes in Zephyr. There are no other bits in the PTE available which can be used to encode the above mentioned "fake" state.

npitre · 2024-09-03T19:56:08Z

OK... I spent some time looking at the X86 code and I get what you're saying. If physical RAM is all below the 2GB mark, you could set bit 31 to indicate a "soft" fault, hoping that no backing store will use location token values that big. Or... you could repurpose MMU_PAT when CONFIG_X86_KPTI=n. That appears to be the case in qemu_x86_tiny_defconfig.

dcpleung · 2024-09-04T19:56:19Z

Using bit 31 is not going to work. There is an internal project which maps virtual address on the upper 2GB range (more towards the end of 4GB).

Re-purposing the PAT bit is not going to work either. In z_x86_page_fault_handler(), and inside #ifdef CONFIG_DEMAND_PAGING, there is #ifdef CONFIG_X86_KPTI. This means that demand paging can function with kernel page table isolation. So that bit cannot be re-purposed again.

npitre · 2024-09-04T20:10:56Z

Using bit 31 is not going to work. There is an internal project which maps
virtual address on the upper 2GB range (more towards the end of 4GB).

This would be bit 31 of a physical address not a virtual one.

Another possibility: is there an actual legitimate case where MMU_US_ORIG
would ever be set and overlap with a user partition? That would be rather
weird.

dcpleung · 2024-09-04T22:38:17Z

Yep... physical addresses too. The memory map is all over the place... :(

I'll need to look into MMU_US_ORIG to see if it can be re-purposed.

dcpleung force-pushed the x86/demand_paging_support_lru branch from 474a06f to f74c249 Compare August 19, 2024 19:05

npitre previously approved these changes Aug 19, 2024

View reviewed changes

nashif previously approved these changes Aug 22, 2024

View reviewed changes

dcpleung added 2 commits August 27, 2024 12:19

x86: mmu/demand_paging: supports LRU eviction algorithm

8cabd9a

The LRU (Least Recently Used) requires additional accounting. So add the code for that. Fixes zephyrproject-rtos#75132 Signed-off-by: Daniel Leung <daniel.leung@intel.com>

dcpleung dismissed stale reviews from nashif and npitre via 8cabd9a August 27, 2024 19:19

dcpleung force-pushed the x86/demand_paging_support_lru branch from f74c249 to 8cabd9a Compare August 27, 2024 19:19

dcpleung marked this pull request as ready for review August 27, 2024 19:19

zephyrbot added the area: X86 x86 Architecture (32-bit) label Aug 27, 2024

zephyrbot requested review from andyross, ceolin, edersondisouza, laurenmurphyx64, najumon1980 and peter-mitsis August 27, 2024 19:20

zephyrbot assigned edersondisouza Aug 27, 2024

dcpleung marked this pull request as draft August 28, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86: mmu/demand_paging: supports LRU eviction algorithm #75140

x86: mmu/demand_paging: supports LRU eviction algorithm #75140

dcpleung commented Jun 27, 2024

dcpleung commented Jun 27, 2024

npitre commented Jun 28, 2024 via email

dcpleung commented Aug 27, 2024

npitre commented Aug 27, 2024

dcpleung commented Aug 27, 2024

npitre commented Aug 27, 2024

dcpleung commented Aug 27, 2024

npitre commented Aug 28, 2024

dcpleung commented Aug 28, 2024 •

edited

Loading

npitre commented Aug 28, 2024

dcpleung commented Aug 28, 2024

npitre commented Aug 28, 2024

dcpleung commented Aug 28, 2024

dcpleung commented Aug 30, 2024

npitre commented Aug 31, 2024

dcpleung commented Sep 3, 2024

npitre commented Sep 3, 2024 via email

dcpleung commented Sep 4, 2024

npitre commented Sep 4, 2024

dcpleung commented Sep 4, 2024

x86: mmu/demand_paging: supports LRU eviction algorithm #75140

Are you sure you want to change the base?

x86: mmu/demand_paging: supports LRU eviction algorithm #75140

Conversation

dcpleung commented Jun 27, 2024

dcpleung commented Jun 27, 2024

npitre commented Jun 28, 2024 via email

dcpleung commented Aug 27, 2024

npitre commented Aug 27, 2024

dcpleung commented Aug 27, 2024

npitre commented Aug 27, 2024

dcpleung commented Aug 27, 2024

npitre commented Aug 28, 2024

dcpleung commented Aug 28, 2024 • edited Loading

npitre commented Aug 28, 2024

dcpleung commented Aug 28, 2024

npitre commented Aug 28, 2024

dcpleung commented Aug 28, 2024

dcpleung commented Aug 30, 2024

npitre commented Aug 31, 2024

dcpleung commented Sep 3, 2024

npitre commented Sep 3, 2024 via email

dcpleung commented Sep 4, 2024

npitre commented Sep 4, 2024

dcpleung commented Sep 4, 2024

dcpleung commented Aug 28, 2024 •

edited

Loading