On Tue, Feb 11, 2025 at 08:25:58AM -0800, Dave Hansen wrote:
arch_memory_failure() but stay on sgx_active_page_list.
page->poison is not checked in the reclaimer logic meaning that a page could be
reclaimed and go through ETRACK, EBLOCK and EWB. This can lead to the
firmware receiving and MCE in one of those operations and going into
"unbreakable shutdown" and triggering a kernel panic on remaining cores.
This requires low-level SGX implementation knowledge to fully
understand. Both what "ETRACK, EBLOCK and EWB" are in the first place,
how they are involved in reclaim and also why EREMOVE doesn't lead to
the same fate.
Does it? [I'll dig up Intel SDM to check this]