Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling

From: William Roche

Date: Mon Mar 16 2026 - 11:34:29 EST

On 3/13/26 21:26, Yazen Ghannam wrote:

On Thu, Mar 12, 2026 at 11:44:04PM +0100, William Roche wrote:

[...]

Yazen may help us on this aspect: Could you please let us know if there is
an AMD specification for accessing SMCA registers on non SMCA machines ?

Now if we had a valid case of an existing non-SMCA AMD hardware that could
crash on updating an SMCA register, the fix would be needed not only for the
VM case.

Yazen, could you also please tell us if an existing non-SMCA AMD hardware
could crash on updating an SMCA register ?

All the systems I have access to are Zen systems, and all Zen systems
are SMCA systems. I'll try to find a older system to test (Bulldozer,
etc.).

I don't think that it is needed anymore, if the bare metal doesn't show this case of AO errors dealt the same way (as discussed below).
It looks to me like the QEMU/KVM VM case could be a specific case, exposed with your new change.

[...]

I have a procedure to verify the behavior: It consists of running the
upstream kernel in a VM (on an AMD platform) and injecting a memory error
from the hardware platform to this VM to mimic a real hardware error being
reported to the platform Kernel.

To do so:
Run Qemu as root (to help with the address translation).
The VM runs the upstream kernel.
Run the small attached program in the VM as root, so that it gives a guest
physical address of one of its mapped memory page.

[root@VM]# ./mce_process_react_x86
Setting Early kill... Ok

Data pages at 0xXXXXXXX physically 0xYYYYY000

-> DON'T Press enter ! (just leave the process wait here)

Ask the emulator (QEMU in this case) to give the host physical address of
the guest physical page:
(qemu) gpa2hpa 0xYYYYY000
Host physical address for 0xYYYYY000 (pc.ram) is 0xPFN000

From the host physical address get the pfn value (removing the last 3 zeros
of the address) to poison.

On the host, use hwpoison kernel module:
[root@host]# modprobe hwpoison_inject

and inject an error to the targeted pfn:
[root@host]# echo 0xPFN > /sys/kernel/debug/hwpoison/corrupt-pfn

Than wait until the Asynchronous error generated reaches the VM (it can take
up to 5 minutes on AMD virtualization) to see the VM kernel deal with it.

...hint for below question.

Without this suggested fix, the VM kernel panics, with the stack trace I
gave:

mce: MSR access error: WRMSR to 0xc0002098 (tried to write
0x0000000000000000)
at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)

amd_clear_bank+0x6e/0x70
machine_check_poll+0x228/0x2e0
? __pfx_mce_timer_fn+0x10/0x10
mce_timer_fn+0xb1/0x130
? __pfx_mce_timer_fn+0x10/0x10
call_timer_fn+0x26/0x120
__run_timers+0x202/0x290
run_timer_softirq+0x49/0x100
handle_softirqs+0xeb/0x2c0
__irq_exit_rcu+0xda/0x100
sysvec_apic_timer_interrupt+0x71/0x90
[...]
Kernel panic - not syncing: MCA architectural violation!

The code flow indicates that a Deferred error was found by MCA polling.

This is right.

I thought QEMU injects a #MC into the guest?

The way AO error handling has been integrated to QEMU/KVM for the AMD VM case relies on machine_check_poll()

William, do you encounter the issue if you disable MCA polling in the
guest?

If I disable machine check polling (with mce=ignore_ce kernel option for example), the AO error is not seen in the VM anymore, and of course we don't crash because of it.

To my knowledge, Deferred errors are reported starting with Zen/SMCA
systems, even though the concept is found in older documentation. This
is another reason for the implicit handling.

I see in QEMU we set the DEFERRED status bit for BUS_MCEERR_AO errors. I
don't recall why we did that. I'll need to review the old threads.

I feel like the intent was to select bits to produce the desired outcome
rather than faithfully replicate hardware behavior. Specifically, the
DEFERRED status bit would prevent CE filtering condition in
do_machine_check(). And it would trigger the AO flow in the guest rather
than the AR flow if we set the UC status bit.

Another example is we use the POISON status bit so the address is marked
as "usable". A real DEFERRED error would never have the POISON status
bit; they are mutually exclusive by definition.

That's the QEMU/KVM choice that was made about 2 years ago, and explained in the following comment of the *QEMU* fix:
4b77512b2782 i386: Fix MCE support for AMD hosts
target/i386/kvm/kvm.c function kvm_mce_inject():

/* Setting the POISON bit for deferred errors indicates to the
* guest kernel that the address provided by the MCE is valid
* and usable which will ensure that the guest kernel will send
* a SIGBUS_AO signal to the guest process. This allows for
* more desirable behavior in the case that the guest process
* with poisoned memory has set the MCE_KILL_EARLY prctl flag
* which indicates that the process would prefer to handle or
* shutdown due to the poisoned memory condition before the
* memory has been accessed.
*
* While the POISON bit would not be set in a deferred error
* sent from hardware, the bit is not meaningful for deferred
* errors and can be reused in this scenario.
*/
status |= MCI_STATUS_DEFERRED | MCI_STATUS_POISON;

But there may be another hidden issue: handling the error through
polling rather than #MC. I'm thinking this isn't intentional, and the
recent Linux changes exposed this behavior.

You are right about "recent Linux changes exposed this behavior", but handling AO this way was intentional.

With the suggested fix, we should cover this new exposed failure case.

Now if we have a better way to deal with AO error handling on AMD VMs, it could be the subject of a separate thread (probably a Qemu thread).
Our current suggested kernel fix would still be valid, even if it the code may not be exercised in the bare-metal case.

Thanks,
Yazen

Thank you very much Yazen for your help !

Cheers,
William.