Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures

From: Lance Yang

Date: Mon Oct 13 2025 - 07:00:45 EST

On 2025/10/13 17:25, David Hildenbrand wrote:

On 13.10.25 11:15, Lance Yang wrote:

@David

Cc: MM CORE folks

On 2025/10/13 12:42, Lance Yang wrote:
[...]

Cool. Hardware error injection with EINJ was the way to go!

I just ran some tests on the shared zero page (both regular and huge), and
found a tricky behavior:

1) When a hardware error is injected into the zeropage, the process that
attempts to read from a mapping backed by it is correctly killed with a
SIGBUS.

2) However, even after the error is detected, the kernel continues to
install
the known-poisoned zeropage for new anonymous mappings ...

For the shared zeropage:
```
[Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in
user-access at 29b8cf5000
[Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to
read_zeropage:13767 due to hardware memory corruption
[Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action
for already poisoned page: Failed
```
And for the shared huge zeropage:
```
[Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in
user-access at 1e1e00000
[Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to
read_huge_zerop:13891 due to hardware memory corruption
[Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for
already poisoned page: Failed
```

Since we've identified an uncorrectable hardware error on such a critical,
singleton page, should we be doing something more?

I mean, regarding the shared zeropage, we could try walking all page tables of all processes and replace it be a fresh shared zeropage.

But then, the page might also be used for other things (I/O etc), the shared zeropage is allocated by the architecture, we'd have to make is_zero_pfn() succeed on the old+new page etc ...

So a lot of work for little benefit I guess? The question is how often we would see that in practice. I'd assume we'd see it happen on random kernel memory more frequently where we can really just bring down the whole machine.

Thanks for your thoughts!

I agree, fixing the regular zeropage is a really mess ...

But for the huge zeropage, what if we just stop installing it once it's
poisoned? We could just disable it globally. Something like this:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f698df156bf8..8543f4385ffe 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2193,6 +2193,10 @@ int memory_failure(unsigned long pfn, int flags)
if (!(flags & MF_SW_SIMULATED))
hw_memory_failure = true;

+ if (is_huge_zero_pfn(pfn))
+ clear_bit(TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+ &transparent_hugepage_flags);
+
p = pfn_to_online_page(pfn);
if (!p) {
res = arch_memory_failure(pfn, flags);

Seems easy enough ...