Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate

From: Long long Xia

Date: Wed Oct 29 2025 - 03:24:31 EST

Thanks for the reply.

在 2025/10/29 14:40, Miaohe Lin 写道:

On 2025/10/28 15:54, Long long Xia wrote:

Thanks for the reply.

在 2025/10/23 19:54, Miaohe Lin 写道:

On 2025/10/16 18:18, Longlong Xia wrote:

From: Longlong Xia <xialonglong@xxxxxxxxxx>

When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.

This patch introduces a recovery mechanism that attempts to
migrate mappings from the failing KSM page to a newly
allocated KSM page or another healthy duplicate already
present in the same chain, before falling back to the
process-killing procedure.

The recovery process works as follows:
1. Identify if the failing KSM page belongs to a stable node chain.
2. Locate a healthy duplicate KSM page within the same chain.
3. For each process mapping the failing page:
    a. Attempt to allocate a new KSM page copy from healthy duplicate
       KSM page. If successful, migrate the mapping to this new KSM page.
    b. If allocation fails, migrate the mapping to the existing healthy
       duplicate KSM page.
4. If all migrations succeed, remove the failing KSM page from the chain.
5. Only if recovery fails (e.g., no healthy duplicate found or migration
    error) does the kernel fall back to killing the affected processes.

Signed-off-by: Longlong Xia <xialonglong@xxxxxxxxxx>

Thanks for your patch. Some comments below.

---
mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 246 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 160787bb121c..9099bad1ab35 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
}
#ifdef CONFIG_MEMORY_FAILURE
+static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
+{
+    struct ksm_stable_node *stable_node, *dup;
+    struct rb_node *node;
+    int nid;
+
+    if (!is_stable_node_dup(dup_node))
+        return NULL;
+
+    for (nid = 0; nid < ksm_nr_node_ids; nid++) {
+        node = rb_first(root_stable_tree + nid);
+        for (; node; node = rb_next(node)) {
+            stable_node = rb_entry(node,
+                    struct ksm_stable_node,
+                    node);
+
+            if (!is_stable_node_chain(stable_node))
+                continue;
+
+            hlist_for_each_entry(dup, &stable_node->hlist,
+                    hlist_dup) {
+                if (dup == dup_node)
+                    return stable_node;
+            }

may I add cond_resched(); here ？

+ }
+ }

Would above multiple loops take a long time in some corner cases?

Thanks for the concern.

I do some simple test。

Test 1: 10 Virtual Machines (Real-world Scenario)
Environment: 10 VMs (256MB each) with KSM enabled

KSM State:
pages_sharing: 262,802 (≈1GB)
pages_shared: 17,374 （≈68MB）
pages_unshared = 124,057 (≈485MB)
total ≈1.5GB
chain_count = 9, not_chain_count = 17152
Red-black tree nodes to traverse:
17,161 (9 chains + 17,152 non-chains)

Performance:
find_chain: 898 μs (0.9 ms)
collect_procs_ksm: 4,409 μs (4.4 ms)
Total memory failure handling: 6,135 μs (6.1 ms)

Test 2: 10GB Single Process (Extreme Case)
Environment: Single process with 10GB memory,
1,310,720 page pairs (each pair identical, different from others)

KSM State:
pages_sharing: 1,311,740 （≈5GB)
pages_shared: 1,310,724 （≈5GB)
pages_unshared = 0
total ≈10GB
Red-black tree nodes to traverse:
1,310,721 (1 chain + 1,310,720 non-chains)

Performance:
find_chain: 28,822 μs (28.8 ms)
collect_procs_ksm: 45,944 μs (45.9 ms)
Total memory failure handling: 46,594 μs (46.6 ms)

Thanks for your test.

Summary:
The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
representing 62% of total memory failure handling time (46.6ms).
However, since memory failures are rare events, this latency may be acceptable
as it does not impact normal system performance and only affects error recovery paths.

IMHO, the execution time of a kernel function must not be too long without any scheduling points.
Otherwise it may affect the normal scheduling of the system and leads to something like performance
fluctuation. Or am I miss something?

Thanks.
.

I will add cond_resched()in the loop of red-black tree to allow scheduling in find_chain(), may be it is enough?

Best regards,
Longlong Xia