Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure

From: Gregory Price

Date: Mon May 11 2026 - 10:33:24 EST

On Mon, May 11, 2026 at 03:32:20PM +0530, Bharata B Rao wrote:
>
>
> On 06-May-26 8:52 PM, Gregory Price wrote:
> > On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
> >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> >>
> >> I continue to think we should not do this.
> >
> > My only pushback on the general "we should not do this" is that we need
> > something to counter-balance the demotion bit in vmscan.c, and the
> > current implementation (prot_none faults) is rather :[
>
> So you are saying pghot subsystem currently does hot page detection and
> promotion only, which is fine. But the current implementation of demotion is not
> very optimal and hence we should spend effort in fine-tuning demotion first?
>

I'm saying because of demotion and fallbacks, we need a mechanism to
handle promotions. I'm not convinced a hotness will extend to coldness
- at least any better than LRU/MGLRU.

> In this series itself I have shown via benchmark numbers that for over-committed
> cases (involving both demotion and promotion), the workload isn't really showing
> real benefit due to demotion and promotion. Are you specifically referring to
> this problem?
>

If over-committed means over-subscribed hot-tier (more hot memory than
available top tier memory), then yeah that result is intuitive. I
haven't pointed to any specific issue, as of yet, still taking time to
consider some of the results.

>
> Can you provide more context about the LRU inversion problem?
>

I've been tracking some data around shrink_folio_list and
alloc_migrate_folio behavior when a low tier node is full.

The result is we end up just swapping memory from high tier straight to
swap and skip demotion, resulting in a bunch of file and anon refaults.

Hardware: Single Socket, 768GB DRAM, 256GB CXL Expander

In this workload, we see swap usage after the full 1TB of memory is
utilized, and as a result we see swap spillage.

second_chance = second alloc attempt in alloc_migrate_folio succeeds
swap_fallback = second chance fails, we swap directly from top tier

Sample data:

pgdemote_kswapd 333052779
pgdemote_direct 3181480482
pgdemote_second_chance 31017629
pgdemote_swap_fallback 335759535
workingset_refault_anon 30106868
workingset_refault_file 2343035341

(note here: swap fallback is number of occurances, while the others are
number of pages. As a result, the actual number of swapped pages is
likely much closer to the pgdemote_direct number)

As a result: LRU is just broken on CXL systems, LRU inverts by design.

In a sane world we would just see the second tier as an extention of the
LRU, but that doesn't necessarily mean we can gleen hotness data from it
(it's still largely a coldness tracking mechanism).

I have patches I haven't RFC'd yet that try to address this, but I need
more time to test it.

I don't think this is something to address with PGHot.

---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 112983b42559..ccdd698c5937 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1043,7 +1043,10 @@ struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
mtc->gfp_mask &= ~__GFP_THISNODE;
mtc->nmask = allowed_mask;

- return alloc_migration_target(src, (unsigned long)mtc);
+ dst = alloc_migration_target(src, (unsigned long)mtc);
+ if (dst)
+ count_vm_events(PGDEMOTE_SECOND_CHANCE, folio_nr_pages(src));
+ return dst;
}

/*
@@ -1616,6 +1619,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* Folios that could not be demoted are still in @demote_folios */
if (!list_empty(&demote_folios)) {
/* Folios which weren't demoted go back on @folio_list */
+ if (!sc->proactive)
+ count_vm_event(PGDEMOTE_SWAP_FALLBACK);
list_splice_init(&demote_folios, folio_list);

/*