Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate

From: Miaohe Lin

Date: Wed Oct 29 2025 - 02:40:18 EST

On 2025/10/28 15:54, Long long Xia wrote:
> Thanks for the reply.
>
> 在 2025/10/23 19:54, Miaohe Lin 写道:
>> On 2025/10/16 18:18, Longlong Xia wrote:
>>> From: Longlong Xia <xialonglong@xxxxxxxxxx>
>>>
>>> When a hardware memory error occurs on a KSM page, the current
>>> behavior is to kill all processes mapping that page. This can
>>> be overly aggressive when KSM has multiple duplicate pages in
>>> a chain where other duplicates are still healthy.
>>>
>>> This patch introduces a recovery mechanism that attempts to
>>> migrate mappings from the failing KSM page to a newly
>>> allocated KSM page or another healthy duplicate already
>>> present in the same chain, before falling back to the
>>> process-killing procedure.
>>>
>>> The recovery process works as follows:
>>> 1. Identify if the failing KSM page belongs to a stable node chain.
>>> 2. Locate a healthy duplicate KSM page within the same chain.
>>> 3. For each process mapping the failing page:
>>>     a. Attempt to allocate a new KSM page copy from healthy duplicate
>>>        KSM page. If successful, migrate the mapping to this new KSM page.
>>>     b. If allocation fails, migrate the mapping to the existing healthy
>>>        duplicate KSM page.
>>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>>     error) does the kernel fall back to killing the affected processes.
>>>
>>> Signed-off-by: Longlong Xia <xialonglong@xxxxxxxxxx>
>> Thanks for your patch. Some comments below.
>>
>>> ---
>>> mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 246 insertions(+)
>>>
>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>> index 160787bb121c..9099bad1ab35 100644
>>> --- a/mm/ksm.c
>>> +++ b/mm/ksm.c
>>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>> }
>>> #ifdef CONFIG_MEMORY_FAILURE
>>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>>> +{
>>> +    struct ksm_stable_node *stable_node, *dup;
>>> +    struct rb_node *node;
>>> +    int nid;
>>> +
>>> +    if (!is_stable_node_dup(dup_node))
>>> +        return NULL;
>>> +
>>> +    for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>>> +        node = rb_first(root_stable_tree + nid);
>>> +        for (; node; node = rb_next(node)) {
>>> +            stable_node = rb_entry(node,
>>> +                    struct ksm_stable_node,
>>> +                    node);
>>> +
>>> +            if (!is_stable_node_chain(stable_node))
>>> +                continue;
>>> +
>>> +            hlist_for_each_entry(dup, &stable_node->hlist,
>>> +                    hlist_dup) {
>>> +                if (dup == dup_node)
>>> +                    return stable_node;
>>> +            }
>>> +        }
>>> +    }
>> Would above multiple loops take a long time in some corner cases?
>
> Thanks for the concern.
>
> I do some simple test。
>
> Test 1: 10 Virtual Machines (Real-world Scenario)
> Environment: 10 VMs (256MB each) with KSM enabled
>
> KSM State:
> pages_sharing: 262,802 (≈1GB)
> pages_shared: 17,374 （≈68MB）
> pages_unshared = 124,057 (≈485MB)
> total ≈1.5GB
> chain_count = 9, not_chain_count = 17152
> Red-black tree nodes to traverse:
> 17,161 (9 chains + 17,152 non-chains)
>
> Performance:
> find_chain: 898 μs (0.9 ms)
> collect_procs_ksm: 4,409 μs (4.4 ms)
> Total memory failure handling: 6,135 μs (6.1 ms)
>
>
> Test 2: 10GB Single Process (Extreme Case)
> Environment: Single process with 10GB memory,
> 1,310,720 page pairs (each pair identical, different from others)
>
> KSM State:
> pages_sharing: 1,311,740 （≈5GB)
> pages_shared: 1,310,724 （≈5GB)
> pages_unshared = 0
> total ≈10GB
> Red-black tree nodes to traverse:
> 1,310,721 (1 chain + 1,310,720 non-chains)
>
> Performance:
> find_chain: 28,822 μs (28.8 ms)
> collect_procs_ksm: 45,944 μs (45.9 ms)
> Total memory failure handling: 46,594 μs (46.6 ms)

Thanks for your test.

>
> Summary:
> The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
> With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
> representing 62% of total memory failure handling time (46.6ms).
> However, since memory failures are rare events, this latency may be acceptable
> as it does not impact normal system performance and only affects error recovery paths.
>

IMHO, the execution time of a kernel function must not be too long without any scheduling points.
Otherwise it may affect the normal scheduling of the system and leads to something like performance
fluctuation. Or am I miss something?

Thanks.
.