Re: spurious (?) mce Hardware Error messages in v6.19
From: Yazen Ghannam
Date: Mon Feb 16 2026 - 15:26:14 EST
On Thu, Feb 12, 2026 at 01:50:05PM +0100, Bert Karwatzki wrote:
> I couldn't test this patch as I was busy figuring out this:
> 243b467dea17 Revert "drm/amd: Check if ASPM is enabled from PCIe subsystem"
> but with this done I could do some testing on v6.19. The periodic bogus mce
> errors are gone because smca_should_log_poll_error() usually returns false, but
> I still get some error messages for which I'm not sure if they are real errors.
>
> I monitored smca_should_log_poll_error() like this (in v6.19 (errors do not occur in v6.18)):
>
> static bool smca_should_log_poll_error(struct mce *m)
> {
> if (m->status & MCI_STATUS_VAL) {
> printk(KERN_INFO "%s: 0\n", __func__);
> return true;
> }
>
> m->status = mce_rdmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank));
> if ((m->status & MCI_STATUS_VAL) && (m->status & MCI_STATUS_DEFERRED)) {
> printk(KERN_INFO "%s: 1\n", __func__);
> m->kflags |= MCE_CHECK_DFR_REGS;
> return true;
> }
>
> printk(KERN_INFO "%s: 2\n", __func__);
> return false;
> }
>
> And get these error messages (usually just once or twice per boot)
>
> Examples from v6.19:
> $ grep -aE "Hardware Error|smca_should_log_poll_error: 1" /var/log/kern.log
>
> 2026-02-10T16:15:01.001203+01:00 lisa kernel: [ C0] smca_should_log_poll_error: 1
> 2026-02-10T16:15:01.001815+01:00 lisa kernel: [T45426] mce: [Hardware Error]: Machine check events logged
> 2026-02-10T16:15:01.001818+01:00 lisa kernel: [T45426] [Hardware Error]: Deferred error, no action required.
> 2026-02-10T16:15:01.001819+01:00 lisa kernel: [T45426] [Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|-|-|AddrV|PCC|-|-|Deferred|-|-]: 0x8700900800000000
> 2026-02-10T16:15:01.001821+01:00 lisa kernel: [T45426] [Hardware Error]: Error Addr: 0x01b3877c00000020
> 2026-02-10T16:15:01.001822+01:00 lisa kernel: [T45426] [Hardware Error]: IPID: 0x000700b040000000
> 2026-02-10T16:15:01.001831+01:00 lisa kernel: [T45426] [Hardware Error]: L3 Cache Ext. Error Code: 0
> 2026-02-10T16:15:01.001832+01:00 lisa kernel: [T45426] [Hardware Error]: cache level: RESV, tx: INSN
>
> 2026-02-11T14:24:13.358353+01:00 lisa kernel: [ C0] smca_should_log_poll_error: 1
> 2026-02-11T14:24:13.358832+01:00 lisa kernel: [T310371] mce: [Hardware Error]: Machine check events logged
> 2026-02-11T14:24:13.361773+01:00 lisa kernel: [T310371] [Hardware Error]: Deferred error, no action required.
> 2026-02-11T14:24:13.361778+01:00 lisa kernel: [T310371] [Hardware Error]: CPU:0 (19:50:0) MC11_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|-]:
> 0x8424b0c8009d011e
> 2026-02-11T14:24:13.361781+01:00 lisa kernel: [T310371] [Hardware Error]: Error Addr: 0x01f8a43400000020
> 2026-02-11T14:24:13.361782+01:00 lisa kernel: [T310371] [Hardware Error]: IPID: 0x000700b040000000, Syndrome: 0x0000000000000042
> 2026-02-11T14:24:13.361787+01:00 lisa kernel: [T310371] [Hardware Error]: L3 Cache Ext. Error Code: 29
> 2026-02-11T14:24:13.361788+01:00 lisa kernel: [T310371] [Hardware Error]: cache level: L2, tx: RESV, mem-tx: RD
>
> 2026-02-12T10:07:28.804529+01:00 lisa kernel: [ C0] smca_should_log_poll_error: 1
> 2026-02-12T10:07:28.805020+01:00 lisa kernel: [T393396] mce: [Hardware Error]: Machine check events logged
> 2026-02-12T10:07:28.805028+01:00 lisa kernel: [T393396] [Hardware Error]: Deferred error, no action required.
> 2026-02-12T10:07:28.805029+01:00 lisa kernel: [T393396] [Hardware Error]: CPU:0 (19:50:0) MC11_STATUS[-|-|-|AddrV|PCC|-|-|Deferred|-|-]: 0x8700900800000000
> 2026-02-12T10:07:28.805030+01:00 lisa kernel: [T393396] [Hardware Error]: Error Addr: 0x01300a9d00000020
> 2026-02-12T10:07:28.805031+01:00 lisa kernel: [T393396] [Hardware Error]: IPID: 0x000700b040000000
> 2026-02-12T10:07:28.805033+01:00 lisa kernel: [T393396] [Hardware Error]: L3 Cache Ext. Error Code: 0
> 2026-02-12T10:07:28.805034+01:00 lisa kernel: [T393396] [Hardware Error]: cache level: RESV, tx: INSN
The first one and the third one are definitely bogus.
This is evident because the "PCC" (Processor Context Corrupt) bit is
set. This is would result in a machine check exception and the kernel
would panic.
The second one seems mostly valid. Though a deferred error cause a
deferred error interrupt. In this case, it is found through timer
polling. And the similarity with the others makes it suspect too.
I think we should filter these out. You can ignore these for now, if
they aren't regularly occurring like before.
>
> Are the "Error Addr" reported here supposed to be physical addresses of memory?
> If they are they don't seem to make sense to me given the following output of
> "cat /proc/iomem":
>
The "Error Addr" is the value of the MCA_ADDR register. This register is
formatted based on what the bank represents and the error code. In this
case, you have an "L3 cache" error. So the address is some
implementation-specific format with set, way, index, etc. But I wouldn't
give much attention to this, since the errors are bogus.
Thanks for following up on this topic. I'll see about a filtering
mechanism. My first thought is to sanity check the status bits, etc.,
and filter anything that isn't consistent with the architecture. And we
can have an option to remove this filtering for those who want all the
data for doing hardware checkout.
Thanks,
Yazen