Re: [RFC 4/7] mm: add page consistency checker implementation
From: Sasha Levin
Date: Mon Apr 27 2026 - 19:25:36 EST
On Mon, Apr 27, 2026 at 09:37:02PM +0200, David Hildenbrand (Arm) wrote:
Thanks, but I fundamentally don't understand how RAS capabilities interact here?
We have mm/memory-failure.c for a reason :)
We do, but self driving safety requires way more than the current hardware can
provide.
I'll point you to https://dl.acm.org/doi/10.1145/2775054.2694348 , which
researched these issues in a datacenter environment (so no sun exposure,
temperature controlled, designed to avoid electromagnetic interference).
"We call a fault that generates an error larger than 2 bits in an ECC word an
undetectable-by-SECDED fault. A fault is undetectable-by-SECDED if it affects
more than two bits in any ECC word, and the data written to that location does
not match the value produced by the fault."
[...]
"A Cielo node has 288 DRAM devices, so this translates to 6048, 518, and 57.6
FIT per node for vendors A, B, and C, respectively. This translates to one
undetected error every 0.8 days, every 9.5 days, and every 85 days on a machine
the size of Cielo."
[...]
"Our main conclusion from this data is that SEC-DED ECC is poorly suited to
modern DRAM subsystems. The rate of undetected errors is too high to justify
its use in very large scale systems comprised of thousands of nodes where
fidelity of results is critical."
Yes, I read before that ECC is insufficient to detect certain bitflips.
But I don't understand how this patch set here is going to move the needle in
any reasonable way?
You have your magical self-driving car algorithm.
Bitflips can corrupt your algorithm, your data, the kernel image, your user page
tables, your kernel page tables. Even a pointer to a bitmap :)
... and we worry about the state of allocated vs. free pages.
Do we agree that this is one piece of a (much) larger puzzle that we would need
to tackle?
Please enlighten me!
Definitely! This is a pretty hefty body of work, so outside of trying to get
the code out there we're also working on documentation, talks, webinars, etc in
the context of ELISA (https://elisa.tech/).
The concept itself was approved by an independant assessor as compliant with
the relevant safety standard, so the story is there, we're just working on
getting it out.
--
Thanks,
Sasha