Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
From: Long long Xia
Date: Wed Oct 29 2025 - 03:24:31 EST
Thanks for the reply.
在 2025/10/29 14:40, Miaohe Lin 写道:
On 2025/10/28 15:54, Long long Xia wrote:may I add cond_resched(); here ?
Thanks for the reply.
在 2025/10/23 19:54, Miaohe Lin 写道:
On 2025/10/16 18:18, Longlong Xia wrote:
From: Longlong Xia <xialonglong@xxxxxxxxxx>Thanks for your patch. Some comments below.
When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.
This patch introduces a recovery mechanism that attempts to
migrate mappings from the failing KSM page to a newly
allocated KSM page or another healthy duplicate already
present in the same chain, before falling back to the
process-killing procedure.
The recovery process works as follows:
1. Identify if the failing KSM page belongs to a stable node chain.
2. Locate a healthy duplicate KSM page within the same chain.
3. For each process mapping the failing page:
a. Attempt to allocate a new KSM page copy from healthy duplicate
KSM page. If successful, migrate the mapping to this new KSM page.
b. If allocation fails, migrate the mapping to the existing healthy
duplicate KSM page.
4. If all migrations succeed, remove the failing KSM page from the chain.
5. Only if recovery fails (e.g., no healthy duplicate found or migration
error) does the kernel fall back to killing the affected processes.
Signed-off-by: Longlong Xia <xialonglong@xxxxxxxxxx>
---
mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 246 insertions(+)
diff --git a/mm/ksm.c b/mm/ksm.c
index 160787bb121c..9099bad1ab35 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
}
#ifdef CONFIG_MEMORY_FAILURE
+static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
+{
+ struct ksm_stable_node *stable_node, *dup;
+ struct rb_node *node;
+ int nid;
+
+ if (!is_stable_node_dup(dup_node))
+ return NULL;
+
+ for (nid = 0; nid < ksm_nr_node_ids; nid++) {
+ node = rb_first(root_stable_tree + nid);
+ for (; node; node = rb_next(node)) {
+ stable_node = rb_entry(node,
+ struct ksm_stable_node,
+ node);
+
+ if (!is_stable_node_chain(stable_node))
+ continue;
+
+ hlist_for_each_entry(dup, &stable_node->hlist,
+ hlist_dup) {
+ if (dup == dup_node)
+ return stable_node;
+ }
Thanks for your test.Thanks for the concern.+ }Would above multiple loops take a long time in some corner cases?
+ }
I do some simple test。
Test 1: 10 Virtual Machines (Real-world Scenario)
Environment: 10 VMs (256MB each) with KSM enabled
KSM State:
pages_sharing: 262,802 (≈1GB)
pages_shared: 17,374 (≈68MB)
pages_unshared = 124,057 (≈485MB)
total ≈1.5GB
chain_count = 9, not_chain_count = 17152
Red-black tree nodes to traverse:
17,161 (9 chains + 17,152 non-chains)
Performance:
find_chain: 898 μs (0.9 ms)
collect_procs_ksm: 4,409 μs (4.4 ms)
Total memory failure handling: 6,135 μs (6.1 ms)
Test 2: 10GB Single Process (Extreme Case)
Environment: Single process with 10GB memory,
1,310,720 page pairs (each pair identical, different from others)
KSM State:
pages_sharing: 1,311,740 (≈5GB)
pages_shared: 1,310,724 (≈5GB)
pages_unshared = 0
total ≈10GB
Red-black tree nodes to traverse:
1,310,721 (1 chain + 1,310,720 non-chains)
Performance:
find_chain: 28,822 μs (28.8 ms)
collect_procs_ksm: 45,944 μs (45.9 ms)
Total memory failure handling: 46,594 μs (46.6 ms)
Summary:IMHO, the execution time of a kernel function must not be too long without any scheduling points.
The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
representing 62% of total memory failure handling time (46.6ms).
However, since memory failures are rare events, this latency may be acceptable
as it does not impact normal system performance and only affects error recovery paths.
Otherwise it may affect the normal scheduling of the system and leads to something like performance
fluctuation. Or am I miss something?
Thanks.
.
I will add cond_resched()in the loop of red-black tree to allow scheduling in find_chain(), may be it is enough?
Best regards,
Longlong Xia