Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
From: Kairui Song
Date: Tue Apr 28 2026 - 14:56:19 EST
On Sat, Apr 18, 2026 at 8:03 PM Barry Song (Xiaomi) <baohua@xxxxxxxxxx> wrote:
>
> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
>
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.
>
> Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> protection"), folios with PG_active were always placed in
> the youngest generation, leading to over-protection and
> increased refaults. After that commit, PG_active folios
> are placed in the second youngest generation, which is
> still too optimistic given the presence of readahead. In
> contrast, the classic active/inactive scheme is more
> conservative.
>
> This patch switches to folio_mark_accessed(). If
> folio_check_references() later detects referenced PTEs,
> the folio will be promoted based on the reference flag
> set by folio_mark_accessed().
>
> The following uses a simple model to demonstrate why the current
> code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
> strided pattern—4KB every 64KB—to simulate prefaulted pages that may
> not be accessed.
>
> #!/bin/bash
>
> CG_NAME="mglru_verify_test"
> CG_PATH="/sys/fs/cgroup/$CG_NAME"
> MEM_LIMIT="400M"
> HOT_SIZE="600M"
>
> # 1. Environment Setup
> sudo rmdir "$CG_PATH" 2>/dev/null
> sudo mkdir -p "$CG_PATH"
> sudo chown -R $USER:$USER "$CG_PATH"
> echo "$MEM_LIMIT" > "$CG_PATH/memory.max"
>
> # 2. Prepare Data Files
> dd if=/dev/urandom of=hot_data.bin bs=1M count=600 conv=notrunc 2>/dev/null
> sync
> echo 3 > /proc/sys/vm/drop_caches
>
> # 3. Start Workload (Working Set)
> (
> echo $BASHPID > "$CG_PATH/cgroup.procs"
> exec ./fio-3.42 --name=hot_ws --rw=read --bs=4K --size=$HOT_SIZE --runtime=600 \
> --zonemode=strided --zonesize=4K --zonerange=64K \
> --time_based --direct=0 --filename=hot_data.bin --ioengine=mmap \
> --fadvise_hint=0 --group_reporting --numjobs=1 > fio.stats
> ) &
> WORKLOAD_PID=$!
>
> # 4. Waiting for hot data to warm up
> sleep 30
> BASE_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
>
> # 5. Running workload for 60second
> sleep 60
>
> # 6. Report refault and IO bandwidth
> FINAL_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
> FINAL_D_FILE=$((FINAL_FILE - BASE_FILE))
> echo "File Refault Delta is $FINAL_D_FILE"
>
> kill $WORKLOAD_PID 2>/dev/null
> sleep 2
> grep -E "READ|WRITE" fio.stats \
> | awk '{for(i=1;i<=NF;i++){if($i ~ /^bw=/) bw=$i; if($i ~ /^io=/) io=$i} print $1, bw, io}'
> rm -f hot_data.bin fio.stats
>
> Without the patch, we observed 12883855 file refaults and a very low
> bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
> hot positions, continuously pushing out the real working set and
> causing incorrect reclaim. With the patch, we observed 0 refaults
> and bandwidth increased to 5078 MiB/s.
>
> Note that this patch does not benefit any platform other than arm64,
> since commit 315d09bf30c2 ("Revert "mm: make faultaround produce old
> ptes"") reverted the change that made prefault PTEs “old”, after it
> was identified as the cause of a ~6% regression in UnixBench on x86.
> This was due to reports that x86 uses an internal microfault mechanism
> for HW AF. The hardware access flag mechanism is relatively expensive
> and can lead to a ~6% UnixBench regression when prefaulted PTEs are
> not marked young directly in the page fault path, especially when
> UnixBench runs without any memory pressure[2].
>
> Thanks to Will for raising this for arm64—“Create ‘old’ PTEs for
> faultaround mappings on arm64 with hardware access flag” [3].
> This is also thanks to arm64 microarchitectures, which incur zero cost
> for HW AF handling.
>
> It may be time for x86 and other architectures to revisit
> whether HW AF is truly costly on their platforms, given that
> the original x86 regression was reported 10 years ago.
>
> For those who want to try the model on x86, you will need the
> following in arch/x86/include/asm/pgtable.h.
>
> #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
> static inline bool arch_wants_old_prefaulted_pte(void)
> {
> return true;
> }
>
> Lance and Xueyuan made a huge contribution to this patch
> through testing. They truly worked over weekends and after
> work hours. If this patch deserves any credit, it belongs to
> them.
>
> [1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@xxxxxxxx/
> [2] https://lore.kernel.org/lkml/20160606022724.GA26227@yexl-desktop/
> [3] https://lore.kernel.org/lkml/20210120173612.20913-1-will@xxxxxxxxxx/
> Tested-by: Lance Yang <lance.yang@xxxxxxxxx>
> Tested-by: Xueyuan Chen <xueyuan.chen21@xxxxxxxxx>
> Cc: Kairui Song <kasong@xxxxxxxxxxx>
> Cc: Qi Zheng <qi.zheng@xxxxxxxxx>
> Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> Cc: wangzicheng <wangzicheng@xxxxxxxxx>
> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> Cc: Lei Liu <liulei.rjpt@xxxxxxxx>
> Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> Cc: Yuanchu Xie <yuanchu@xxxxxxxxxx>
> Cc: Wei Xu <weixugc@xxxxxxxxxx>
> Cc: Will Deacon <will@xxxxxxxxxx>
> Signed-off-by: Barry Song (Xiaomi) <baohua@xxxxxxxxxx>
> ---
> -rfc was:
> [PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped
> https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@xxxxxxxxx/
>
> mm/swap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 5cc44f0de987..e3cf703ccb89 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
> /* see the comment in lru_gen_folio_seq() */
> if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> - folio_set_active(folio);
> + folio_mark_accessed(folio);
Hi Barry,
Sorry I haven't checked everything yet, but just a naive idea: What if
we just remove this whole lru_gen_* check chunk here? Only keep the
one in workingset.c to do the folio_set_active so refaulted folios are
promoted like classical LRU, I have a series to restore the refault
distance based activation for MGLRU:
https://lwn.net/Articles/945266/
That series from me above is a bit buggy, but easy to fix, I can
resend it. Some workload benefits a lot from it, like the one in the
cover letter. And the latest MGLRU is still not performing well with
these workloads.
Is there any evidence that folios that are allocated through fault are
always frequently used folios? Because classical LRU has the exact
opposite assumption on that. Refault distance based activation is more
battle tested (I'm not saying that is absolutely right though).
Will the performance be worse or better if we remove this activation
here, and instead only do the activation through folio_mark_accessed
(not right now, see below), page table walk, and refauting distance
checking?
Oh and, right now MGLRU performance badly on some workload because
folio_mark_accessed never activate a folio, which can also be fixed
with:
https://github.com/ryncsn/linux/blob/b4/mglru-lfu/mm/swap.c#L393 (I
hope I can sent it out as RFC if I can finish the benchmark and
tweaking before LSFMM but sorry for now I'll just share this link...)
This is the LSFMMBPF topic idea I proposed, there folio_mark_accessed
calls folio_inc_lru_refs which will promote the folio for exact one
gen if the access count goes beyond LRU_REFS_MAX, making MGLRU
frequence aware and much more proactive on certain workloads. Testing
with YCSB on the server and using that on my phone are both looking
great.
It also remove the force protection on eviction path (that "if (refs +
workingset != BIT(LRU_REFS_WIDTH) + 1)" check, which is added about a
year or two later after the first MGLRU release), that force
protection is causing trouble too cause some cold folios with high
historical access count will stuck in LRU for a bit longer.
In general I think it might be a good idea to weaken or maybe just
remove this activation here. Need some time to discuss and verify
though.