Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling

From: William Roche

Date: Thu Mar 12 2026 - 11:19:16 EST


Thank you for your reply,


On 3/12/26 15:42, Borislav Petkov wrote:
On Wed, Feb 18, 2026 at 04:30:25PM +0000, “William Roche wrote:
From: William Roche <william.roche@xxxxxxxxxx>

A non Scalable MCA system may prevent access to SMCA specific registers

"may prevent"?

Please explain in the commit message the whole scenario how you're triggering
this in detail.


From the kernel point of view (regardless if it is running on bare metal or in a VM), access to these registers registers is provided by the platform: either the Hardware or the emulation framework.

Yazen indicated on Feb 12 that "AMD systems generally have a Read-as-Zero/Writes-Ignored behavior when accessing unimplemented MCA registers", but you rightly indicated on Feb 9 that "KVM works as advertized" and so prevents access to unimplemented SMCA specific registers. That's the reason why I had to say "may".

This access crashes on AMD VMs and "may" work on AMD hardware according to Yazen.


like MCA_DESTAT. This is the case of QEMU/KVM VMs, where the kernel
has to check for the SMCA feature before accessing MCA_DESTAT.

Fixes: 7cb735d7c0cb ("x86/mce: Unify AMD DFR handler with MCA Polling")
Signed-off-by: William Roche <william.roche@xxxxxxxxxx>
Reviewed-by: Yazen Ghannam <yazen.ghannam@xxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx

AFAIR, you're injecting errors. This is not really a critical fix that
warrants this going to stable.

Errors are injected into VMs by the hypervisor when real memory hardware errors occur on the system that impact the VM address space.
This is not only a test, this is real life mechanism. With the fix 7cb735d7c0cb that has been integrated, VMs kernel running on AMD now crashes on Deferred errors, where it used to be able to deal with them before this commit.
That's the reason why we need this additional fix.


---
arch/x86/kernel/cpu/mce/amd.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 3f1dda355307..7b9932f13bca 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -875,13 +875,18 @@ void amd_clear_bank(struct mce *m)
{
amd_reset_thr_limit(m->bank);
- /* Clear MCA_DESTAT for all deferred errors even those logged in MCA_STATUS. */
- if (m->status & MCI_STATUS_DEFERRED)
- mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
+ if (mce_flags.smca) {

All this code should not run in a VM. So why does it?

Why do you say that this code should not run in a VM ?
Error injection mechanism has been running for several years with QEMU/KVM.
I must be missing something here. Please let me know.


What is the use case we're supposed to support here?


Dealing with real life deferred memory errors impacting VMs address space.

I hope this clarifies the need for this new kernel fix.

Thanks again,
William.