Re: [RFC 4/7] mm: add page consistency checker implementation

From: David Hildenbrand (Arm)

Date: Tue Apr 28 2026 - 03:36:58 EST


On 4/28/26 01:24, Sasha Levin wrote:
> On Mon, Apr 27, 2026 at 09:37:02PM +0200, David Hildenbrand (Arm) wrote:
>>
>>>
>>> We do, but self driving safety requires way more than the current hardware can
>>> provide.
>>>
>>> I'll point you to https://dl.acm.org/doi/10.1145/2775054.2694348 , which
>>> researched these issues in a datacenter environment (so no sun exposure,
>>> temperature controlled, designed to avoid electromagnetic interference).
>>>
>>> "We call a fault that generates an error larger than 2 bits in an ECC word an
>>> undetectable-by-SECDED fault. A fault is undetectable-by-SECDED if it affects
>>> more than two bits in any ECC word, and the data written to that location does
>>> not match the value produced by the fault."
>>>
>>> [...]
>>>
>>> "A Cielo node has 288 DRAM devices, so this translates to 6048, 518, and 57.6
>>> FIT per node for vendors A, B, and C, respectively. This translates to one
>>> undetected error every 0.8 days, every 9.5 days, and every 85 days on a machine
>>> the size of Cielo."
>>>
>>> [...]
>>>
>>> "Our main conclusion from this data is that SEC-DED ECC is poorly suited to
>>> modern DRAM subsystems. The rate of undetected errors is too high to justify
>>> its use in very large scale systems comprised of thousands of nodes where
>>> fidelity of results is critical."
>>
>> Yes, I read before that ECC is insufficient to detect certain bitflips.
>>
>> But I don't understand how this patch set here is going to move the needle in
>> any reasonable way?
>>
>> You have your magical self-driving car algorithm.
>>
>> Bitflips can corrupt your algorithm, your data, the kernel image, your user page
>> tables, your kernel page tables. Even a pointer to a bitmap :)
>>
>> ... and we worry about the state of allocated vs. free pages.
>
> Do we agree that this is one piece of a (much) larger puzzle that we would need
> to tackle?

Once you solved the real hard problems (corrupting random page state, page
tables, all of that) we can think about whether adding complexity to the page
allocator to detect possible corruptions.

As you state in your reply to Vlasta, the buddy keeps free pages in a list. So a
pointer corruption there would be rather fatal, and I don't follow how the
approach here makes things any better.

So for the time being, I don't think this proposal moves the needle in any
reasonable way, and I don't think we want this any time soon.

--
Cheers,

David