Re: [RFC] AMD VM crashing on deferred memory error injection
From: William Roche
Date: Tue Feb 10 2026 - 20:43:53 EST
On 2/9/26 22:18, Yazen Ghannam wrote:
On Mon, Feb 09, 2026 at 04:08:19PM -0500, Yazen Ghannam wrote:
On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:
[...]
According to me, this small kernel fix relies too much on a Qemu AMD
specific implementation detail.
Would you have a more appropriate fix to suggest please ?
Thanks in advance for your feedback.
William.
Thanks William for the report and details.
Clearing "STATUS" registers is a normal part of MCA handling.
We seem to allow clearing the regular "MCi_STATUS" register. I assume
this gets trapped/ignored by the hypervisor.
I expect we need to do the same behavior for the "MCA_DESTAT" register.
I'll do some research here, but please do share any pointers you may
have.
Yazen, I'm simply trying to find an answer in the AMD64 Architecture Programmer's Manual, Volume 2: System Programming, 24593
This documents indicates (In chapter 9.3.3.4 MCA Deferred Error Status Register) that:
"When the deferred error has been processed by the deferred error handler, MCA_DESTAT should be
cleared. If MCA_STATUS also contains a deferred error, MCA_STATUS should be cleared."
So I would imagine that allowing the reset of MCA_DESTAT the same way as MCA_STATUS should be what the platform has to allow (or ignore).
Sorry for the rapid reply, but I think this is where we need an update.
Linux:
arch/x86/kvm/x86.c : set_msr_mce()
Please note the comment:
"All CPUs allow writing 0 to MCi_STATUS MSRs to clear the MSR."
We should include the MCA_DESTAT register range here.
What do you think?
But before trying to update the set_msr_mce() function, I don't think
that KVM keeps track of an MSR_AMD64_SMCA_MCx_DESTAT set of registers.
I can see mce_banks (for ctl, status, addr and misc) and mci_ctl2_banks
locations in struct kvm_vcpu_arch, but I don't see a location for SMCA
banks like MCA_DESTAT MSRs.
So if we make kvm ignore this update instead of raising a #GP error,
would it be a valid solution ?
Thanks,
William.