Re: [PATCH v3 2/2] ksm: Optimize rmap_walk_ksm by passing a suitable address range

From: Hugh Dickins

Date: Sun Apr 05 2026 - 00:44:36 EST

On Thu, 12 Feb 2026, xu.xin16@xxxxxxxxxx wrote:

> From: xu xin <xu.xin16@xxxxxxxxxx>
>
> Problem
> =======
> When available memory is extremely tight, causing KSM pages to be swapped
> out, or when there is significant memory fragmentation and THP triggers
> memory compaction, the system will invoke the rmap_walk_ksm function to
> perform reverse mapping. However, we observed that this function becomes
> particularly time-consuming when a large number of VMAs (e.g., 20,000)
> share the same anon_vma. Through debug trace analysis, we found that most
> of the latency occurs within anon_vma_interval_tree_foreach, leading to an
> excessively long hold time on the anon_vma lock (even reaching 500ms or
> more), which in turn causes upper-layer applications (waiting for the
> anon_vma lock) to be blocked for extended periods.
>
> Root Cause
> ==========
> Further investigation revealed that 99.9% of iterations inside the
> anon_vma_interval_tree_foreach loop are skipped due to the first check
> "if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
> number of loop iterations are ineffective. This inefficiency arises because
> the pgoff_start and pgoff_end parameters passed to
> anon_vma_interval_tree_foreach span the entire address space from 0 to
> ULONG_MAX, resulting in very poor loop efficiency.
>
> Solution
> ========
> In fact, we can significantly improve performance by passing a more precise
> range based on the given addr. Since the original pages merged by KSM
> correspond to anonymous VMAs, the page offset can be calculated as
> pgoff = address >> PAGE_SHIFT. Therefore, we can optimize the call by
> defining:
>
> pgoff = rmap_item->address >> PAGE_SHIFT;
>
> Performance
> ===========
> In our real embedded Linux environment, the measured metrcis were as
> follows:
>
> 1) Time_ms: Max time for holding anon_vma lock in a single rmap_walk_ksm.
> 2) Nr_iteration_total: The max times of iterations in a loop of anon_vma_interval_tree_foreach
> 3) Skip_addr_out_of_range: The max times of skipping due to the first check (vma->vm_start
> and vma->vm_end) in a loop of anon_vma_interval_tree_foreach.
> 4) Skip_mm_mismatch: The max times of skipping due to the second check (rmap_item->mm == vma->vm_mm)
> in a loop of anon_vma_interval_tree_foreach.
>
> The result is as follows:
>
> Time_ms Nr_iteration_total Skip_addr_out_of_range Skip_mm_mismatch
> Before: 228.65 22169 22168 0
> After : 0.396 3 0 2
>
> The referenced reproducer of rmap_walk_ksm can be found at:
> https://lore.kernel.org/all/20260206151424734QIyWL_pA-1QeJPbJlUxsO@xxxxxxxxxx/
>
> Co-developed-by: Wang Yaxin <wang.yaxin@xxxxxxxxxx>
> Signed-off-by: Wang Yaxin <wang.yaxin@xxxxxxxxxx>
> Signed-off-by: xu xin <xu.xin16@xxxxxxxxxx>

This is a very attractive speedup, but I believe it's flawed: in the
special case when a range has been mremap-moved, when its anon folio
indexes and anon_vma pgoff correspond to the original user address,
not to the current user address.

In which case, rmap_walk_ksm() will be unable to find all the PTEs
for that KSM folio, which will consequently be pinned in memory -
unable to be reclaimed, unable to be migrated, unable to be hotremoved,
until it's finally unmapped or KSM disabled.

But it's years since I worked on KSM or on anon_vma, so I may be confused
and my belief wrong. I have tried to test it, and my testcase did appear
to show 7.0-rc6 successfully swapping out even mremap-moved KSM folios,
but mm.git failing to do so. However, I say "appear to show" because I
found swapping out any KSM pages harder than I'd been expecting: so have
some doubts about my testing. Let me give more detail on that at the
bottom of this mail: it's a tangent which had better not distract from
your speedup.

If I'm right that your patch is flawed, what to do?

Perhaps there is, or could be, a cleverer way for KSM to walk the anon_vma
interval tree, which can handle the mremap-moved pgoffs appropriately.
Cc'ing Michel, whose bf181b9f9d8d ("mm anon rmap: replace same_anon_vma
linked list with an interval tree.") specifically chose the 0, ULONG_MAX
which you are replacing.

Cc'ing Lorenzo, who is currently considering replacing anon_vma by
something more like my anonmm, which preceded Andrea's anon_vma in 2.6.7;
but Lorenzo supplementing it with the mremap tracking which defeated me.
This rmap_walk_ksm() might well benefit from his approach. (I'm not
actually expecting any input from Lorenzo here, or Michel: more FYIs.)

But more realistic in the short term, might be for you to keep your
optimization, but fix the lookup, by keeping a count of PTEs found,
and when that falls short, take a second pass with 0, ULONG_MAX.
Somewhat ugly, certainly imperfect, but good enough for now.

More comment on KSM swapout below...

> ---
> mm/ksm.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 950e122bcbf4..7b974f333391 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3170,6 +3170,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
> /* Ignore the stable/unstable/sqnr flags */
> const unsigned long addr = rmap_item->address & PAGE_MASK;
> + const pgoff_t pgoff = rmap_item->address >> PAGE_SHIFT;
> struct anon_vma *anon_vma = rmap_item->anon_vma;
> struct anon_vma_chain *vmac;
> struct vm_area_struct *vma;
> @@ -3183,8 +3184,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> anon_vma_lock_read(anon_vma);
> }
>
> + /*
> + * Currently KSM folios are order-0 normal pages, so pgoff_end
> + * should be the same as pgoff_start.
> + */
> anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
> - 0, ULONG_MAX) {
> + pgoff, pgoff) {
>
> cond_resched();
> vma = vmac->vma;
> --
> 2.25.1

Unrelated to this patch, but when I tried to test KSM swapout (even
without mremap), it first appeared not to be working. Quite likely
my testcase was too simple and naive, not indicating any problem in
real world usage. But checking back on much older kernels, I did
find that 5.8 swapped KSM as I was expecting, 5.9 not.

Bisected to commit b518154e59aa ("mm/vmscan: protect the workingset
on anonymous LRU"), the one which changed all those
lru_cache_add_active_or_unevictable()s to
lru_cache_add_inactive_or_unevictable()s.

I rather think that mm/ksm.c should have been updated at that time.
Here's the patch I went on to use in testing the mremap question
(I still had to do more memhogging than 5.8 had needed, but that's
probably just reflective of what that commit was intended to fix).

I'm not saying the below is the right patch (it would probably be
better to replicate the existing flags); but throw it out there
for someone more immersed in KSM to pick up and improve upon.

Hugh

--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1422,7 +1422,7 @@ static int replace_page(struct vm_area_s
if (!is_zero_pfn(page_to_pfn(kpage))) {
folio_get(kfolio);
folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
- newpte = mk_pte(kpage, vma->vm_page_prot);
+ newpte = pte_mkold(mk_pte(kpage, vma->vm_page_prot));
} else {
/*
* Use pte_mkdirty to mark the zero page mapped by KSM, and then
@@ -1514,7 +1514,7 @@ static int try_to_merge_one_page(struct
* stable_tree_insert() will update stable_node.
*/
folio_set_stable_node(folio, NULL);
- folio_mark_accessed(folio);
+// folio_mark_accessed(folio);
/*
* Page reclaim just frees a clean folio with no dirty
* ptes: make sure that the ksm page would be swapped.