Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF

From: Barry Song

Date: Tue Apr 28 2026 - 18:26:53 EST


On Wed, Apr 29, 2026 at 2:55 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Sat, Apr 18, 2026 at 8:03 PM Barry Song (Xiaomi) <baohua@xxxxxxxxxx> wrote:
> >
> > MGLRU gives high priority to folios mapped in page tables.
> > As a result, folio_set_active() is invoked for all folios
> > read during page faults. In practice, however, readahead
> > can bring in many folios that are never accessed via page
> > tables.
> >
> > A previous attempt by Lei Liu proposed introducing a separate
> > LRU for readahead[1] to make readahead pages easier to reclaim,
> > but that approach is likely over-engineered.
> >
> > Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> > protection"), folios with PG_active were always placed in
> > the youngest generation, leading to over-protection and
> > increased refaults. After that commit, PG_active folios
> > are placed in the second youngest generation, which is
> > still too optimistic given the presence of readahead. In
> > contrast, the classic active/inactive scheme is more
> > conservative.
> >
> > This patch switches to folio_mark_accessed(). If
> > folio_check_references() later detects referenced PTEs,
> > the folio will be promoted based on the reference flag
> > set by folio_mark_accessed().
> >
> > The following uses a simple model to demonstrate why the current
> > code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
> > strided pattern—4KB every 64KB—to simulate prefaulted pages that may
> > not be accessed.
> >
> > #!/bin/bash
> >
> > CG_NAME="mglru_verify_test"
> > CG_PATH="/sys/fs/cgroup/$CG_NAME"
> > MEM_LIMIT="400M"
> > HOT_SIZE="600M"
> >
> > # 1. Environment Setup
> > sudo rmdir "$CG_PATH" 2>/dev/null
> > sudo mkdir -p "$CG_PATH"
> > sudo chown -R $USER:$USER "$CG_PATH"
> > echo "$MEM_LIMIT" > "$CG_PATH/memory.max"
> >
> > # 2. Prepare Data Files
> > dd if=/dev/urandom of=hot_data.bin bs=1M count=600 conv=notrunc 2>/dev/null
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> >
> > # 3. Start Workload (Working Set)
> > (
> > echo $BASHPID > "$CG_PATH/cgroup.procs"
> > exec ./fio-3.42 --name=hot_ws --rw=read --bs=4K --size=$HOT_SIZE --runtime=600 \
> > --zonemode=strided --zonesize=4K --zonerange=64K \
> > --time_based --direct=0 --filename=hot_data.bin --ioengine=mmap \
> > --fadvise_hint=0 --group_reporting --numjobs=1 > fio.stats
> > ) &
> > WORKLOAD_PID=$!
> >
> > # 4. Waiting for hot data to warm up
> > sleep 30
> > BASE_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
> >
> > # 5. Running workload for 60second
> > sleep 60
> >
> > # 6. Report refault and IO bandwidth
> > FINAL_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
> > FINAL_D_FILE=$((FINAL_FILE - BASE_FILE))
> > echo "File Refault Delta is $FINAL_D_FILE"
> >
> > kill $WORKLOAD_PID 2>/dev/null
> > sleep 2
> > grep -E "READ|WRITE" fio.stats \
> > | awk '{for(i=1;i<=NF;i++){if($i ~ /^bw=/) bw=$i; if($i ~ /^io=/) io=$i} print $1, bw, io}'
> > rm -f hot_data.bin fio.stats
> >
> > Without the patch, we observed 12883855 file refaults and a very low
> > bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
> > hot positions, continuously pushing out the real working set and
> > causing incorrect reclaim. With the patch, we observed 0 refaults
> > and bandwidth increased to 5078 MiB/s.
> >
> > Note that this patch does not benefit any platform other than arm64,
> > since commit 315d09bf30c2 ("Revert "mm: make faultaround produce old
> > ptes"") reverted the change that made prefault PTEs “old”, after it
> > was identified as the cause of a ~6% regression in UnixBench on x86.
> > This was due to reports that x86 uses an internal microfault mechanism
> > for HW AF. The hardware access flag mechanism is relatively expensive
> > and can lead to a ~6% UnixBench regression when prefaulted PTEs are
> > not marked young directly in the page fault path, especially when
> > UnixBench runs without any memory pressure[2].
> >
> > Thanks to Will for raising this for arm64—“Create ‘old’ PTEs for
> > faultaround mappings on arm64 with hardware access flag” [3].
> > This is also thanks to arm64 microarchitectures, which incur zero cost
> > for HW AF handling.
> >
> > It may be time for x86 and other architectures to revisit
> > whether HW AF is truly costly on their platforms, given that
> > the original x86 regression was reported 10 years ago.
> >
> > For those who want to try the model on x86, you will need the
> > following in arch/x86/include/asm/pgtable.h.
> >
> > #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
> > static inline bool arch_wants_old_prefaulted_pte(void)
> > {
> > return true;
> > }
> >
> > Lance and Xueyuan made a huge contribution to this patch
> > through testing. They truly worked over weekends and after
> > work hours. If this patch deserves any credit, it belongs to
> > them.
> >
> > [1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@xxxxxxxx/
> > [2] https://lore.kernel.org/lkml/20160606022724.GA26227@yexl-desktop/
> > [3] https://lore.kernel.org/lkml/20210120173612.20913-1-will@xxxxxxxxxx/
> > Tested-by: Lance Yang <lance.yang@xxxxxxxxx>
> > Tested-by: Xueyuan Chen <xueyuan.chen21@xxxxxxxxx>
> > Cc: Kairui Song <kasong@xxxxxxxxxxx>
> > Cc: Qi Zheng <qi.zheng@xxxxxxxxx>
> > Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> > Cc: wangzicheng <wangzicheng@xxxxxxxxx>
> > Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> > Cc: Lei Liu <liulei.rjpt@xxxxxxxx>
> > Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> > Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> > Cc: Yuanchu Xie <yuanchu@xxxxxxxxxx>
> > Cc: Wei Xu <weixugc@xxxxxxxxxx>
> > Cc: Will Deacon <will@xxxxxxxxxx>
> > Signed-off-by: Barry Song (Xiaomi) <baohua@xxxxxxxxxx>
> > ---
> > -rfc was:
> > [PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped
> > https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@xxxxxxxxx/
> >
> > mm/swap.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 5cc44f0de987..e3cf703ccb89 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
> > /* see the comment in lru_gen_folio_seq() */
> > if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > - folio_set_active(folio);
> > + folio_mark_accessed(folio);
>
> Hi Barry,
>
> Sorry I haven't checked everything yet, but just a naive idea: What if
> we just remove this whole lru_gen_* check chunk here? Only keep the

Do you mean the below?

index 5cc44f0de987..499ad49c1b51 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -509,11 +509,6 @@ void folio_add_lru(struct folio *folio)
folio_test_unevictable(folio), folio);
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

- /* see the comment in lru_gen_folio_seq() */
- if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
- lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
- folio_set_active(folio);
-
folio_batch_add_and_move(folio, lru_add);
}
EXPORT_SYMBOL(folio_add_lru);

If so, this essentially resembles the active/inactive LRU. But I
assume Yu Zhao’s earlier point about mmaped folio access still
has some merit? The problem, however, is that readahead and
prefaulting may have made this assumption less accurate, since
being mmaped doesn’t necessarily mean the user actually wants
to access it.

Dropping folio_mark_accessed(), we would need two scans to
confirm a mmaped folio is active. This seems reasonable to me
on platforms other than arm64, since they always set access
flags for prefaulted folios. The first scan would clear the
prefaulted access flag (which is fake), and the second scan
would confirm that the folio was actually accessed.

But for arm64, it seems we might slightly negatively impact PTE-mapped
folios?

I mean, I’m at least convinced the following might be correct:

@@ -509,10 +511,14 @@ void folio_add_lru(struct folio *folio)
folio_test_unevictable(folio), folio);
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

- /* see the comment in lru_gen_folio_seq() */
- if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
- lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
- folio_set_active(folio);
+ /*
+ * For architectures without old prefaulted PTEs, we need a first
+ * PTE scan to clear the access flag set during prefault, and a second
+ * scan to confirm the folio is active. For architectures with old
+ * prefaulted PTEs, we can skip the scan that clears the access flag.
+ */
+ if (arch_wants_old_prefaulted_pte())
+ folio_mark_accessed(folio);

folio_batch_add_and_move(folio, lru_add);
}

It could also be the case below to check whether fault_around is
disabled, if it’s not too ugly :-)

+ if (arch_wants_old_prefaulted_pte() || fault_around_bytes == PAGE_SIZE)
+ folio_mark_accessed(folio);

I suspect the above code also fixes the fio workload performance I posted in the
changelog for x86. Let me queue it for testing.

BTW, it seems we can also fix set_pte_range(). The prefault check
feels quite useless to me—just let folio_referenced do one extra
scan.

diff --git a/mm/memory.c b/mm/memory.c
index ea6568571131..bee58a8fee0a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5593,13 +5593,12 @@ void set_pte_range(struct vm_fault *vmf,
struct folio *folio,
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
- bool prefault = !in_range(vmf->address, addr, nr * PAGE_SIZE);
pte_t entry;

flush_icache_pages(vma, page, nr);
entry = mk_pte(page, vma->vm_page_prot);

- if (prefault && arch_wants_old_prefaulted_pte())
+ if (arch_wants_old_prefaulted_pte())
entry = pte_mkold(entry);
else
entry = pte_sw_mkyoung(entry);


> one in workingset.c to do the folio_set_active so refaulted folios are
> promoted like classical LRU, I have a series to restore the refault
> distance based activation for MGLRU:
> https://lwn.net/Articles/945266/
>
> That series from me above is a bit buggy, but easy to fix, I can
> resend it. Some workload benefits a lot from it, like the one in the
> cover letter. And the latest MGLRU is still not performing well with
> these workloads.
>
> Is there any evidence that folios that are allocated through fault are
> always frequently used folios? Because classical LRU has the exact
> opposite assumption on that. Refault distance based activation is more
> battle tested (I'm not saying that is absolutely right though).

I agree with this. I’m also queuing some code for testing
to check whether reclamation has occurred very recently.
If so, we set the folios active:
https://lore.kernel.org/linux-mm/20260428013520.47417-1-baohua@xxxxxxxxxx/

So basically we’re on the same page, just taking slightly different
approaches to checking recency during refault?

>
> Will the performance be worse or better if we remove this activation
> here, and instead only do the activation through folio_mark_accessed
> (not right now, see below), page table walk, and refauting distance
> checking?
>

As explained above, I think it is probably sensible to remove the
chunk for x86, but not for arm64.

> Oh and, right now MGLRU performance badly on some workload because
> folio_mark_accessed never activate a folio, which can also be fixed
> with:
> https://github.com/ryncsn/linux/blob/b4/mglru-lfu/mm/swap.c#L393 (I
> hope I can sent it out as RFC if I can finish the benchmark and
> tweaking before LSFMM but sorry for now I'll just share this link...)

Thanks, I’d be glad to read it once you post the RFC.

>
> This is the LSFMMBPF topic idea I proposed, there folio_mark_accessed
> calls folio_inc_lru_refs which will promote the folio for exact one
> gen if the access count goes beyond LRU_REFS_MAX, making MGLRU
> frequence aware and much more proactive on certain workloads. Testing
> with YCSB on the server and using that on my phone are both looking
> great.
>
> It also remove the force protection on eviction path (that "if (refs +
> workingset != BIT(LRU_REFS_WIDTH) + 1)" check, which is added about a
> year or two later after the first MGLRU release), that force
> protection is causing trouble too cause some cold folios with high
> historical access count will stuck in LRU for a bit longer.
>
> In general I think it might be a good idea to weaken or maybe just
> remove this activation here. Need some time to discuss and verify
> though.

Yep, many thanks for your points.

Thanks
Barry