Re: spurious (?) mce Hardware Error messages in v6.19

From: Bert Karwatzki

Date: Sun Apr 05 2026 - 04:47:44 EST


Am Freitag, dem 03.04.2026 um 16:05 +0200 schrieb Borislav Petkov:
> On Mon, Feb 23, 2026 at 04:53:16PM -0500, Yazen Ghannam wrote:
> > Thanks Bert for confirming.
> >
> > I'll send a patch to filter this signature.
>
> Bert, pls try this:
>
> From: Yazen Ghannam <yazen.ghannam@xxxxxxx>
> Date: Sat, 28 Feb 2026 09:08:14 -0500
> Subject: [PATCH] x86/mce/amd: Filter bogus hardware errors on Zen3 clients
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Users have been observing multiple L3 cache deferred errors after recent
> kernel rework of deferred error handling.¹ ⁴
>
> The errors are bogus due to inconsistent status values. Also, user verified
> that bogus MCA_DESTAT values are present on the system even with an older
> kernel.²
>
> The errors seem to be garbage values present in the MCA_DESTAT of some L3
> cache banks. These were implicitly ignored before the recent kernel rework
> because these do not generate a deferred error interrupt.
>
> A later revision of the rework patch was merged for v6.19. This naturally
> filtered out most of the bogus error logs. However, a few signatures still
> remain.³
>
> Minimize the scope of the filter to the reported CPU
> family/model/stepping and only for errors which don't have the Enabled
> bit in the MCi status MSR.
>
> ¹ https://lore.kernel.org/20250915010010.3547-1-spasswolf@xxxxxx
> ² https://lore.kernel.org/6e1eda7dd55f6fa30405edf7b0f75695cf55b237.camel@xxxxxx
> ³ https://lore.kernel.org/21ba47fa8893b33b94370c2a42e5084cf0d2e975.camel@xxxxxx
>https://lore.kernel.org/r/CAKFB093B2k3sKsGJ_QNX1jVQsaXVFyy=wNwpzCGLOXa_vSDwXw@xxxxxxxxxxxxxx
>
> [ bp: Generalize the condition according to which errors are bogus. ]
>
> Fixes: 7cb735d7c0cb ("x86/mce: Unify AMD DFR handler with MCA Polling")
> Closes: https://lore.kernel.org/20250915010010.3547-1-spasswolf@xxxxxx
> Reported-by: Bert Karwatzki <spasswolf@xxxxxx>
> Signed-off-by: Yazen Ghannam <yazen.ghannam@xxxxxxx>
> Signed-off-by: Borislav Petkov (AMD) <bp@xxxxxxxxx>
> Reviewed-by: Mario Limonciello <mario.limonciello@xxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Link: https://lore.kernel.org/20250915010010.3547-1-spasswolf@xxxxxx
> ---
> arch/x86/kernel/cpu/mce/amd.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index 146f4207a863..7fc78759cd4e 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -606,6 +606,14 @@ bool amd_filter_mce(struct mce *m)
> enum smca_bank_types bank_type = smca_get_bank_type(m->extcpu, m->bank);
> struct cpuinfo_x86 *c = &boot_cpu_data;
>
> + /* Bogus hw errors on Cezanne A0. */
> + if (c->x86 == 0x19 &&
> + c->x86_model == 0x50 &&
> + c->x86_stepping == 0x0) {
> + if (!(m->status & MCI_STATUS_EN))
> + return true;
> + }
> +
> /* See Family 17h Models 10h-2Fh Erratum #1114. */
> if (c->x86 == 0x17 &&
> c->x86_model >= 0x10 && c->x86_model <= 0x2F &&
> --
> 2.51.0
>

I tested this patch on v6.19.11 and as these bogus messages are pretty rare 
I added a monitoring printk():

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 159f0becf8cc..54fa3863ea0b 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -608,8 +608,10 @@ bool amd_filter_mce(struct mce *m)
if (c->x86 == 0x19 &&
c->x86_model == 0x50 &&
c->x86_stepping == 0x0) {
- if (!(m->status & MCI_STATUS_EN))
+ if (!(m->status & MCI_STATUS_EN)) {
+ printk(KERN_INFO "%s: filtering bogus hw error on Cezanne A0\n", __func__);
return true;
+ }
}

/* See Family 17h Models 10h-2Fh Erratum #1114. */

After ~12h of uptime I got the messages that a bogus error was filtered:
[42603.594231] [ C0] amd_filter_mce: filtering bogus hw error on Cezanne A0
So the patch seems to work fine:

Tested-By: Bert Karwatzki <spasswolf@xxxxxx>

Bert Karwatzki