Re: [PATCH v10 0/6] mm/memory-failure: add panic option for unrecoverable pages

From: Andrew Morton

Date: Fri Jun 26 2026 - 12:35:05 EST


On Fri, 26 Jun 2026 08:33:14 -0700 Breno Leitao <leitao@xxxxxxxxxx> wrote:

> A multi-bit ECC error on a kernel-owned page that the memory failure
> handler cannot recover is currently swallowed: PG_hwpoison is set, the
> event is logged, and the kernel keeps running. The corrupted memory
> remains accessible to the kernel and either drives silent data
> corruption or surfaces seconds-to-minutes later as an apparently
> unrelated crash. In a large fleet that delayed, unattributable crash
> turns into significant engineering effort to root-cause; in a kdump
> configuration, by the time the crash happens the original error
> context (faulting PFN, MCE/GHES record, page state) is long gone.
>
> This series adds an opt-in sysctl,
> vm.panic_on_unrecoverable_memory_failure, that converts an
> unrecoverable kernel-page hwpoison event into an immediate panic with
> a clean dmesg/vmcore that still contains the original failure
> context. The default is disabled so existing workloads see no
> change.

Cool, thanks. I added this to mm.git's mm-new branch. Next week I'll
move it into the mm-unstable branch, where it will receive linux-next
exposure.

Sashiko identified a few possible things, some pre-existing:

https://sashiko.dev/#/patchset/20260626-ecc_panic-v10-0-6dacb8ad024d@xxxxxxxxxx