[PATCH v7 0/6] KSM: performance optimizations for rmap_walk_ksm
From: xu.xin16
Date: Sat May 30 2026 - 04:59:28 EST
From: xu xin <xu.xin16@xxxxxxxxxx>
This series fixes a severe KSM reverse-mapping performance problem
that can freeze applications for hundreds of milliseconds under
memory pressure especially when a lot of unrelated VMAs sharing a
single anon_vma.
Two key highlights:
1. Lock hold time drops from >500ms to <2ms
- In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
anon_vma lock hold time during KSM rmap walk went from 705ms
down to 1.67ms (max) and 1.44ms (avg).
2. Real user impact
- The anon_vma lock is also acquired by page faults, reclaim,
migration, compaction, mlock, exit_mmap, and cgroup accounting.
- A long hold due to inefficient rmap walks stalls application
threads, causing latency spikes, reduced throughput, or even
container timeouts.
- The problem occurs even without fork() – VMA splitting (e.g.,
via mprotect or madvise over time) can create tens of thousands
of VMAs all attached to the same anon_vma.
Real-world examples:
- JVM / Go runtime: These use mmap for heap regions and later call
mprotect(PROT_NONE) for garbage collection barriers or guard pages,
splitting the original VMA into thousands of small pieces over time.
- Database engines (MySQL, PostgreSQL): Large shared memory buffers
or anonymous mappings are managed with madvise(MADV_DONTNEED) to release
specific pages, which also splits VMAs.
* Why the benchmark numbers are realistic: We observed ~20,000 VMAs sharing
one anon_vma on a production system running a Java application with KSM
enabled. The lock hold time before the patch was measured at 228 ms (max)
during rmap walks triggered by memory compaction and page migration.
The benchmark reproduces that VMA count and lock‑hold behavior in a
controlled environment.
For systems that do not have thousands of VMAs per anon_vma, the
patch adds negligible overhead (a single pgoff comparison). For systems
that do suffer from this issue, the improvement is dramatic:
1) Worst‑case anon_vma lock hold time drops from hundreds of milliseconds
to under 2 ms.2)This directly reduces blocking of parallel operations that
need the same lock – page faults, reclaim, migration, compaction, mlock, and
exit_mmap.
End‑users will see lower tail latency (fewer application stalls),
higher throughput under memory pressure, and no more spurious
lockup warnings or container timeouts caused by excessive lock hold
times.
In short: workloads that do not hit this pathological pattern are
unaffected; those that do will see a 100x to 500x reduction in lock
hold times, which translates directly into a more responsive system.
Change Log
==========
Changes in v7:
Mainly to fix some issues AI review points out at:
https://sashiko.dev/#/patchset/20260522105234715fKI7KSsjC5XpEVMwoV6rI@xxxxxxxxxx
We have completely correct those possible flaws according to AI useful suggestions.
- Patch 2: There are mainly 3 changes as follows.
(1) Use COMM-PID filtering during trace parsing to precisely match the right
events.
(2) Graceful handling of single‑NUMA node. trigger_rmap_walk() no longer calls
exit(1) when no other NUMA node is available. It returns an error, allowing
the caller to clean up (disable tracepoints, restore KSM config) before exiting.
(3) Fair comparison for anonymous / file tests with KSM. anonymous and file‑backed tests now
use fork() to create thousands of child processes, each sharing the same physical
page via copy‑on‑write (or MAP_SHARED). This ensures that for all three page types
the latency measurement is based on a single physical page mapped by many VMAs (≈ NR_SHARERS).
- Patch 6: There are mainly 3 changes as follows.
(1) Fix mapping size tracking after mremap and protect the original pointer on failure.
(2) Use baseline delta comparison to eliminate interference from global KSM counters.
(3) Fix error-code confusion caused by pread/close interactions.
Changes in v6:
- Patch 1: Defining a single event class once and instantiating the individual
tracepoints with DEFINE_EVENT, as AI said:
https://sashiko.dev/#/patchset/20260519220536792dMIKRMurt3vZ5lXC5pwh8@xxxxxxxxxx
- Patch 2: Suggested-by AI below, three useful changes are done:
(1) Safe event pairing – Now stores folio and rwc addresses for rmap_walk_start
and matches with the same addresses in rmap_walk_end, eliminating
cross‑thread interference.
(2 )KSM configuration preservation – Saves original KSM settings and restores
them after the KSM test, avoiding persistent changes to system behaviour.
(3) unlink in advance to prevent potentialfile leak – unlink(filename) called
immediately after mkstemp, so the temporary file is automatically removed
even if the program crashes early.
- Patch 3: a separate, standalone patch to update the MAINTAINERS file.
Changes in v5:
- Patch 1: replaced local_clock() with tracepoints – no overhead
when tracepoints are disabled.
- Patch 3: switched from vm_pgoff (unstable after VMA split) to a
linear page offset.
- Patch 4: adapted to the linear page offset; added user-impact
description (real workloads, lock contention examples,
VMA splitting scenario).
- Patch 5: simplified to a single process with 32 pages (instead
of multi-process), as suggested by David.
Changes in v4:
- Add a tracepoint for rmap_walk
- Provide a testbench for rmap_walk
- Add vm_pgoff field in ksm_rmap_item
- use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)
Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.
Changes in v2:
- Use const variable to initialize 'addr' "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)
xu xin (6):
mm/rmap: add tracepoint for rmap_walk
tools/testing: add rmap walk latency benchmark
MAINTAINERS: add myself as reviewer for rmap section
ksm: add pgoff into ksm_rmap_item
ksm: Optimize rmap_walk_ksm by passing a suitable address range
ksm: add mremap selftests for ksm_rmap_walk
MAINTAINERS | 3 +
include/trace/events/rmap.h | 67 +++
mm/ksm.c | 48 +-
mm/rmap.c | 9 +
tools/testing/rmap/Makefile | 11 +
tools/testing/rmap/rmap_benchmark.c | 674 +++++++++++++++++++++++++++
tools/testing/selftests/mm/rmap.c | 97 ++++
tools/testing/selftests/mm/vm_util.c | 47 ++
tools/testing/selftests/mm/vm_util.h | 2 +
9 files changed, 950 insertions(+), 8 deletions(-)
create mode 100644 include/trace/events/rmap.h
create mode 100644 tools/testing/rmap/Makefile
create mode 100644 tools/testing/rmap/rmap_benchmark.c
--
2.25.1