Re: [RFC 4/7] mm: add page consistency checker implementation

From: David Hildenbrand (Arm)

Date: Mon Apr 27 2026 - 15:37:15 EST

>>
>> Thanks, but I fundamentally don't understand how RAS capabilities interact here?
>> We have mm/memory-failure.c for a reason :)
>
> We do, but self driving safety requires way more than the current hardware can
> provide.
>
> I'll point you to https://dl.acm.org/doi/10.1145/2775054.2694348 , which
> researched these issues in a datacenter environment (so no sun exposure,
> temperature controlled, designed to avoid electromagnetic interference).
>
> "We call a fault that generates an error larger than 2 bits in an ECC word an
> undetectable-by-SECDED fault. A fault is undetectable-by-SECDED if it affects
> more than two bits in any ECC word, and the data written to that location does
> not match the value produced by the fault."
>
> [...]
>
> "A Cielo node has 288 DRAM devices, so this translates to 6048, 518, and 57.6
> FIT per node for vendors A, B, and C, respectively. This translates to one
> undetected error every 0.8 days, every 9.5 days, and every 85 days on a machine
> the size of Cielo."
>
> [...]
>
> "Our main conclusion from this data is that SEC-DED ECC is poorly suited to
> modern DRAM subsystems. The rate of undetected errors is too high to justify
> its use in very large scale systems comprised of thousands of nodes where
> fidelity of results is critical."

Yes, I read before that ECC is insufficient to detect certain bitflips.

But I don't understand how this patch set here is going to move the needle in
any reasonable way?

You have your magical self-driving car algorithm.

Bitflips can corrupt your algorithm, your data, the kernel image, your user page
tables, your kernel page tables. Even a pointer to a bitmap :)

... and we worry about the state of allocated vs. free pages.

Please enlighten me!

>
> The passengers you've mentioned before would be excited if they knew how high
> the bar is around their safety :)
Heh :)

--
Cheers,

David