Re: [PATCH v8 2/3] ksm: Optimize rmap_walk_ksm by passing a suitable page index
From: David Hildenbrand (Arm)
Date: Tue Jun 09 2026 - 04:31:40 EST
On 6/9/26 06:47, xu.xin16@xxxxxxxxxx wrote:
> From: xu xin <xu.xin16@xxxxxxxxxx>
>
> User impact / Why this matters to Linux users
> =============================================
> When a system runs with KSM enabled and memory becomes tight, KSM pages
> may be swapped out or migrated. The kernel then performs a reverse map
> walk by rmap_walk_ksm to locate all page table entries that reference
> these pages. If A large number of unrelated VMAs can attach to a single
> anon_vma related with this KSM page, then rmap_walk might be severe
> performance bottleneck. In our embedded test environment, we observed
> ~20,000 VMAs sharing one anon_vma without any fork – purely from VMA
> splits, which cause 200~700ms duration of rmap_walk_ksm.
>
> When one of those VMAs mapped a KSM page, then this KSM page's rmapping
> will become bottleneck with hold its anon_vma lock for a long time. The
> anon_vma lock is not only used by KSM; it is a core lock protecting the
> VMA interval tree and is acquired by many critical memory operations:
>
> • Page faults: do_anonymous_page(), do_wp_page() (during COW)
> • Memory reclaim: try_to_unmap()
> • Page migration & compaction: migrate_pages(), compact_zone()
> • mlock / munlock: mlock_fixup()
> • Process exit: exit_mmap() (tearing down VMAs)
> • Cgroup memory accounting: mem_cgroup_move_charge()
>
> If one thread holds the anon_vma lock for hundreds of milliseconds
> because of an inefficient KSM rmap walk, any other thread that tries to
> acquire the same lock (e.g., an application taking a page fault, kswapd
> reclaiming pages, or a migration thread) will block. This leads to
> stalled application threads, increased latency spikes, and in extreme
> cases container timeouts or watchdog triggers.
>
> This patch reduces the worst-case anon_vma lock hold time during KSM
> rmap walk from >500 ms to <1 ms, thereby almost eliminating this
> source of lock contention and improving system responsiveness under
> memory pressure.
>
> Real-world examples:
> ====================
> - JVM / Go runtime: These use mmap for heap regions and later call
> mprotect(PROT_NONE) for garbage collection barriers or guard pages,
> splitting the original VMA into thousands of small pieces over time.
>
> - Database engines (MySQL, PostgreSQL): Large shared memory buffers
> or anonymous mappings are managed with madvise(MADV_DONTNEED) to
> release specific pages, which also splits VMAs.
>
> * Why the benchmark numbers are realistic: We observed ~20,000 VMAs
The "*" at the start looks odd (the list above used "-", and I assume this
should be separate from the list).
> sharing one anon_vma on a production system running a Java application
> with KSM enabled. The lock hold time before the patch was measured at
> 228 ms (max) during rmap walks triggered by memory compaction and page
> migration. The benchmark reproduces that VMA count and lock‑hold
> behavior in a controlled environment.
>
> Root Cause
> ==========
> Through local debugging trace analysis, we found that most of the latency
> of rmap_walk_ksm occurs within anon_vma_interval_tree_foreach(), leading
> to an excessively long hold time on the anon_vma lock (even reaching 500ms
> or more), which in turn causes upper-layer applications (waiting for the
> anon_vma lock) to be blocked for extended periods.
>
> Further investigation revealed that 99.9% of iterations inside the
> anon_vma_interval_tree_foreach loop are skipped due to the first check
> "if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
> number of loop iterations are ineffective. This inefficiency arises because
> the start page index and the end page index parameters passed to
> anon_vma_interval_tree_foreach span the entire address space from 0 to
> ULONG_MAX, resulting in very poor loop efficiency.
>
> Solution
> ========
> We cannot rely solely on anon_vma to locate all PTEs mapping this page
> but also need to have the original page's linear_page_index. Since the
> implementation of anon_vma_interval_tree_foreach — it essentially
> iterates to find a suitable VMA such that the provided page index falls
> within the candidate's vm_pgoff range.
>
> vm_pgoff <= original linear page offset <= (vm_pgoff + vma_pages(v) - 1)
>
> Fortunately, we have already linear_page_index. in ksm_rmap_item in the
Misplaced "."
> previos patch of series, so that we use it to get the index to accelerate
> the searching.
Avoid talking a about "previous patch" in a series. Something like:
"Fortunately, an earlier commit introduced the linear_page_index to struct
ksm_rmap_item, allowing for optimizing the RMAP walk."
>
> Test results
> ============
> A rmap testbench can be obtained with two Out-Of-Tree patches at [1][2].
> After applying the OOT patches and building rmap_benchmark from:
> tools/testing/rmap/rmap_benchmark.c, we can start the performance test.
>
> The testing result in QEMU is shown as follows:
>
> KSM rmapping Maximum duration Average duration
>
> Before: 705.12 ms (705119858 ns) 532.04 ms (532041586 ns)
> After: 1.67 ms (1665917 ns) 1.44 ms (1443784 ns)
>
> [1] https://lore.kernel.org/all/202605301703094695zmVgcSC27BNR0rH0N8_x@xxxxxxxxxx
> [2] https://lore.kernel.org/all/20260530170404509QpJmBtpSjn3uQHeVKA2iA@xxxxxxxxxx/
>
> Co-developed-by: Wang Yaxin <wang.yaxin@xxxxxxxxxx>
> Signed-off-by: Wang Yaxin <wang.yaxin@xxxxxxxxxx>
> Signed-off-by: xu xin <xu.xin16@xxxxxxxxxx>
> ---
> mm/ksm.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index e0ba29e3c0a4..9e1879d96751 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3208,6 +3208,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
> /* Ignore the stable/unstable/sqnr flags */
> const unsigned long addr = rmap_item->address & PAGE_MASK;
> + const unsigned long index = rmap_item->linear_page_index;
remove_rmap_item_from_tree() does the hlist_del(&rmap_item->hlist); once we
clear STABLE_FLAG + rmap_item->linear_page_index.
So, this should always be valid (just like rmap_item->anon_vma). Good.
> struct anon_vma *anon_vma = rmap_item->anon_vma;
> struct anon_vma_chain *vmac;
> struct vm_area_struct *vma;
> @@ -3221,8 +3222,12 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> anon_vma_lock_read(anon_vma);
> }
>
> + /*
> + * Currently KSM folios are order-0 normal pages, so the end
> + * page's index should be the same as the start page's index.
Maybe best to say "Currently, KSM folios are always small folios, so it's
sufficient to search for a single page."
I don't even want to imagine how large-folio support could look like, and how it
interacts with rmap_items :/ Let's hope we'll never have to go there, and if so,
this code here will be the least of our concerns.
Do we want to explain a bit how this works and on which properties this relies on?
"We can simply use the linear_page_index of the de-duplicated anonymous page
that we remembered in the rmap_item while de-duplicating. Note that
mremap() always de-duplicates KSM folios: so if there was mremap() in our parent
or our child, we wouldn't have the KSM folio mapped in these processes anymore."
> + */
> anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
> - 0, ULONG_MAX) {
> + index, index) {
>
> cond_resched();
> vma = vmac->vma;
I hope we're not missing some other weird corner cases. Sashiko seems to be
happy, but I don't trust that ;)
Hoping other people can also give this another look before we move this upstream.
Acked-by: David Hildenbrand (Arm) <david@xxxxxxxxxx>
--
Cheers,
David