Re: [RFC] AMD VM crashing on deferred memory error injection

From: William Roche

Date: Tue Feb 10 2026 - 20:43:53 EST


On 2/9/26 22:18, Yazen Ghannam wrote:
On Mon, Feb 09, 2026 at 04:08:19PM -0500, Yazen Ghannam wrote:
On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:

[...]

According to me, this small kernel fix relies too much on a Qemu AMD
specific implementation detail.

Would you have a more appropriate fix to suggest please ?

Thanks in advance for your feedback.
William.

Thanks William for the report and details.

Clearing "STATUS" registers is a normal part of MCA handling.

We seem to allow clearing the regular "MCi_STATUS" register. I assume
this gets trapped/ignored by the hypervisor.

I expect we need to do the same behavior for the "MCA_DESTAT" register.

I'll do some research here, but please do share any pointers you may
have.

Yazen, I'm simply trying to find an answer in the AMD64 Architecture Programmer's Manual, Volume 2: System Programming, 24593

This documents indicates (In chapter 9.3.3.4 MCA Deferred Error Status Register) that:
"When the deferred error has been processed by the deferred error handler, MCA_DESTAT should be
cleared. If MCA_STATUS also contains a deferred error, MCA_STATUS should be cleared."

So I would imagine that allowing the reset of MCA_DESTAT the same way as MCA_STATUS should be what the platform has to allow (or ignore).


Sorry for the rapid reply, but I think this is where we need an update.

Linux:
arch/x86/kvm/x86.c : set_msr_mce()

Please note the comment:
"All CPUs allow writing 0 to MCi_STATUS MSRs to clear the MSR."

We should include the MCA_DESTAT register range here.

What do you think?

But before trying to update the set_msr_mce() function, I don't think
that KVM keeps track of an MSR_AMD64_SMCA_MCx_DESTAT set of registers.
I can see mce_banks (for ctl, status, addr and misc) and mci_ctl2_banks
locations in struct kvm_vcpu_arch, but I don't see a location for SMCA
banks like MCA_DESTAT MSRs.

So if we make kvm ignore this update instead of raising a #GP error,
would it be a valid solution ?

Thanks,
William.