[PATCH v4 4/5] ksm: Optimize rmap_walk_ksm by passing a suitable address range

From: xu.xin16

Date: Sun May 03 2026 - 08:50:30 EST


From: xu xin <xu.xin16@xxxxxxxxxx>

Problem
=======
When available memory is extremely tight, causing KSM pages to be swapped
out, or when there is significant memory fragmentation and THP triggers
memory compaction, the system will invoke the rmap_walk_ksm function to
perform reverse mapping. However, we observed that this function becomes
particularly time-consuming when a large number of VMAs (e.g., 20,000)
share the same anon_vma. Through debug trace analysis, we found that most
of the latency occurs within anon_vma_interval_tree_foreach, leading to an
excessively long hold time on the anon_vma lock (even reaching 500ms or
more), which in turn causes upper-layer applications (waiting for the
anon_vma lock) to be blocked for extended periods.

Root Cause
==========
Further investigation revealed that 99.9% of iterations inside the
anon_vma_interval_tree_foreach loop are skipped due to the first check
"if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
number of loop iterations are ineffective. This inefficiency arises because
the pgoff_start and pgoff_end parameters passed to
anon_vma_interval_tree_foreach span the entire address space from 0 to
ULONG_MAX, resulting in very poor loop efficiency.

Solution
========
We cannot rely solely on anon_vma to locate all PTEs mapping this page but
also need to have the original page's pgoff. In fact, I believe only the
original vma->vm_pgoff is just enough. The implementation of
anon_vma_interval_tree_foreach — it essentially iterates to find a suitable
VMA such that the provided pgoff falls within the candidate's vm_pgoff range.

vm_pgoff <= pgoff_parameter <= (vm_pgoff + vma_pages(v) - 1)

Fortunately, we have already vm_pgoff in ksm_rmap_item in the previos patch
of series, so that we use it to get the pgoff to accelerate the searching.

Performance
===========
In our real embedded Linux environment, the measured metrcis were as
follows:

1) Time_ms: Max time for holding anon_vma lock in a single rmap_walk_ksm.
2) Nr_iteration_total: The max times of iterations in a loop of anon_vma_interval_tree_foreach
3) Skip_addr_out_of_range: The max times of skipping due to the first check (vma->vm_start
and vma->vm_end) in a loop of anon_vma_interval_tree_foreach.
4) Skip_mm_mismatch: The max times of skipping due to the second check (rmap_item->mm == vma->vm_mm)
in a loop of anon_vma_interval_tree_foreach.

The result is shown as follows:

Time_ms Nr_iteration_total Skip_addr_out_of_range Skip_mm_mismatch
Before: 228.65 22169 22168 0
After : 0.396 3 0 2

We also provide a rmap testbench: tools/testing/rmap/rmap_benchmark.c
The testing result in QEMU is shown as follows:

Maximum duration Average duration
Before: 705.12 ms (705119858 ns) 532.04 ms (532041586 ns)
After: 1.67 ms (1665917 ns) 1.44 ms (1443784 ns)

Co-developed-by: Wang Yaxin <wang.yaxin@xxxxxxxxxx>
Signed-off-by: Wang Yaxin <wang.yaxin@xxxxxxxxxx>
Signed-off-by: xu xin <xu.xin16@xxxxxxxxxx>
---
mm/ksm.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 0299a53ba7c9..a13184d00759 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3200,6 +3200,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
/* Ignore the stable/unstable/sqnr flags */
const unsigned long addr = rmap_item->address & PAGE_MASK;
+ const unsigned long vm_pgoff = rmap_item->vm_pgoff;
struct anon_vma *anon_vma = rmap_item->anon_vma;
struct anon_vma_chain *vmac;
struct vm_area_struct *vma;
@@ -3213,8 +3214,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
anon_vma_lock_read(anon_vma);
}

+ /*
+ * Currently KSM folios are order-0 normal pages, so pgoff_end
+ * should be the same as pgoff_start.
+ */
anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
- 0, ULONG_MAX) {
+ vm_pgoff, vm_pgoff) {

cond_resched();
vma = vmac->vma;
--
2.25.1