Re: AMD SMCA: L3 cache scrubber deferred errors flood logs on Zen 3 after 6.19 polling changes

From: Borislav Petkov

Date: Fri Apr 03 2026 - 10:10:36 EST


On Tue, Mar 31, 2026 at 01:12:21AM +0200, Borislav Petkov wrote:
> Just to let you know that we're looking into it - it simply takes a while
> until we figure out what we wanna do here exactly.
>
> So thanks for the patience.

This should fix it:

---

From: Yazen Ghannam <yazen.ghannam@xxxxxxx>
Date: Sat, 28 Feb 2026 09:08:14 -0500
Subject: [PATCH] x86/mce/amd: Filter bogus hardware errors on Zen3 clients
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Users have been observing multiple L3 cache deferred errors after recent
kernel rework of deferred error handling.¹ ⁴

The errors are bogus due to inconsistent status values. Also, user verified
that bogus MCA_DESTAT values are present on the system even with an older
kernel.²

The errors seem to be garbage values present in the MCA_DESTAT of some L3
cache banks. These were implicitly ignored before the recent kernel rework
because these do not generate a deferred error interrupt.

A later revision of the rework patch was merged for v6.19. This naturally
filtered out most of the bogus error logs. However, a few signatures still
remain.³

Minimize the scope of the filter to the reported CPU
family/model/stepping and only for errors which don't have the Enabled
bit in the MCi status MSR.

¹ https://lore.kernel.org/20250915010010.3547-1-spasswolf@xxxxxx
² https://lore.kernel.org/6e1eda7dd55f6fa30405edf7b0f75695cf55b237.camel@xxxxxx
³ https://lore.kernel.org/21ba47fa8893b33b94370c2a42e5084cf0d2e975.camel@xxxxxx
https://lore.kernel.org/r/CAKFB093B2k3sKsGJ_QNX1jVQsaXVFyy=wNwpzCGLOXa_vSDwXw@xxxxxxxxxxxxxx

[ bp: Generalize the condition according to which errors are bogus. ]

Fixes: 7cb735d7c0cb ("x86/mce: Unify AMD DFR handler with MCA Polling")
Closes: https://lore.kernel.org/20250915010010.3547-1-spasswolf@xxxxxx
Reported-by: Bert Karwatzki <spasswolf@xxxxxx>
Signed-off-by: Yazen Ghannam <yazen.ghannam@xxxxxxx>
Signed-off-by: Borislav Petkov (AMD) <bp@xxxxxxxxx>
Reviewed-by: Mario Limonciello <mario.limonciello@xxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Link: https://lore.kernel.org/20250915010010.3547-1-spasswolf@xxxxxx
---
arch/x86/kernel/cpu/mce/amd.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 146f4207a863..7fc78759cd4e 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -606,6 +606,14 @@ bool amd_filter_mce(struct mce *m)
enum smca_bank_types bank_type = smca_get_bank_type(m->extcpu, m->bank);
struct cpuinfo_x86 *c = &boot_cpu_data;

+ /* Bogus hw errors on Cezanne A0. */
+ if (c->x86 == 0x19 &&
+ c->x86_model == 0x50 &&
+ c->x86_stepping == 0x0) {
+ if (!(m->status & MCI_STATUS_EN))
+ return true;
+ }
+
/* See Family 17h Models 10h-2Fh Erratum #1114. */
if (c->x86 == 0x17 &&
c->x86_model >= 0x10 && c->x86_model <= 0x2F &&
--
2.51.0

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette