Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling

From: Yazen Ghannam

Date: Fri Mar 13 2026 - 16:28:23 EST

On Thu, Mar 12, 2026 at 11:44:04PM +0100, William Roche wrote:

[...]

>
> Yazen may help us on this aspect: Could you please let us know if there is
> an AMD specification for accessing SMCA registers on non SMCA machines ?
>
>
> Now if we had a valid case of an existing non-SMCA AMD hardware that could
> crash on updating an SMCA register, the fix would be needed not only for the
> VM case.
>
> Yazen, could you also please tell us if an existing non-SMCA AMD hardware
> could crash on updating an SMCA register ?
>

All the systems I have access to are Zen systems, and all Zen systems
are SMCA systems. I'll try to find a older system to test (Bulldozer,
etc.).

[...]

>
> I have a procedure to verify the behavior: It consists of running the
> upstream kernel in a VM (on an AMD platform) and injecting a memory error
> from the hardware platform to this VM to mimic a real hardware error being
> reported to the platform Kernel.
>
> To do so:
> Run Qemu as root (to help with the address translation).
> The VM runs the upstream kernel.
> Run the small attached program in the VM as root, so that it gives a guest
> physical address of one of its mapped memory page.
>
> [root@VM]# ./mce_process_react_x86
> Setting Early kill... Ok
>
> Data pages at 0xXXXXXXX physically 0xYYYYY000
>
> -> DON'T Press enter ! (just leave the process wait here)
>
> Ask the emulator (QEMU in this case) to give the host physical address of
> the guest physical page:
> (qemu) gpa2hpa 0xYYYYY000
> Host physical address for 0xYYYYY000 (pc.ram) is 0xPFN000
>
> From the host physical address get the pfn value (removing the last 3 zeros
> of the address) to poison.
>
> On the host, use hwpoison kernel module:
> [root@host]# modprobe hwpoison_inject
>
> and inject an error to the targeted pfn:
> [root@host]# echo 0xPFN > /sys/kernel/debug/hwpoison/corrupt-pfn
>
> Than wait until the Asynchronous error generated reaches the VM (it can take
> up to 5 minutes on AMD virtualization) to see the VM kernel deal with it.

...hint for below question.

>
> Without this suggested fix, the VM kernel panics, with the stack trace I
> gave:
>
> mce: MSR access error: WRMSR to 0xc0002098 (tried to write
> 0x0000000000000000)
> at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)
>
> amd_clear_bank+0x6e/0x70
> machine_check_poll+0x228/0x2e0
> ? __pfx_mce_timer_fn+0x10/0x10
> mce_timer_fn+0xb1/0x130
> ? __pfx_mce_timer_fn+0x10/0x10
> call_timer_fn+0x26/0x120
> __run_timers+0x202/0x290
> run_timer_softirq+0x49/0x100
> handle_softirqs+0xeb/0x2c0
> __irq_exit_rcu+0xda/0x100
> sysvec_apic_timer_interrupt+0x71/0x90
> [...]
> Kernel panic - not syncing: MCA architectural violation!

The code flow indicates that a Deferred error was found by MCA polling.

I thought QEMU injects a #MC into the guest?

William, do you encounter the issue if you disable MCA polling in the
guest?

To my knowledge, Deferred errors are reported starting with Zen/SMCA
systems, even though the concept is found in older documentation. This
is another reason for the implicit handling.

I see in QEMU we set the DEFERRED status bit for BUS_MCEERR_AO errors. I
don't recall why we did that. I'll need to review the old threads.

I feel like the intent was to select bits to produce the desired outcome
rather than faithfully replicate hardware behavior. Specifically, the
DEFERRED status bit would prevent CE filtering condition in
do_machine_check(). And it would trigger the AO flow in the guest rather
than the AR flow if we set the UC status bit.

Another example is we use the POISON status bit so the address is marked
as "usable". A real DEFERRED error would never have the POISON status
bit; they are mutually exclusive by definition.

But there may be another hidden issue: handling the error through
polling rather than #MC. I'm thinking this isn't intentional, and the
recent Linux changes exposed this behavior.

Thanks,
Yazen