Re: [PATCH v2 1/1] x86/mce/amd: Fix VM crash during deferred error handling

From: William Roche

Date: Thu Mar 12 2026 - 18:46:24 EST

Thank you for taking the time to explain your worries about the context of this fix integration, and I do hope my feedback can help to convince you.

On 3/12/26 17:04, Borislav Petkov wrote:

On Thu, Mar 12, 2026 at 04:11:10PM +0100, William Roche wrote:

From the kernel point of view (regardless if it is running on bare metal or
in a VM), access to these registers registers is provided by the platform:
either the Hardware or the emulation framework.

Except the emulation doesn't emulate the platform properly. We test on real
hw. If your hypervisor doesn't do that properly then that's not really
upstream kernel's problem.

There are several aspects that are worth considering here:
First, I totally agree that the emulation has to emulate properly ! :)

The problem we are facing is to consider non-SMCA platform reaction to updating and SMCA specific register.
And is the QEMU/KVM VM reaction as a non-SCMA machine a valid case ?

In this VM case, the MSR handling emulation is done by KVM which doesn't implement a "permissive" access to unimplemented registers. I also agreed with you when you said that it is working as advertised.
Now if emulating an AMD platform requires to provide a "permissive" access to a specific set of registers, the fix would not be absolutely necessary. But I may have missed a specification about that. And if such a thing exists, it would also be all kernels (including upstream) responsibility to take that into account.

Yazen may help us on this aspect: Could you please let us know if there is an AMD specification for accessing SMCA registers on non SMCA machines ?

Now if we had a valid case of an existing non-SMCA AMD hardware that could crash on updating an SMCA register, the fix would be needed not only for the VM case.

Yazen, could you also please tell us if an existing non-SMCA AMD hardware could crash on updating an SMCA register ?

The commit 7cb735d7c0cb [x86/mce: Unify AMD DFR handler with MCA Polling] written by Yazen,
introduced an upstream kernel problem on non-SMCA platforms that has been revealed by the emulation framework on AMD. That's the reason why I think it should be fixed in upstream too. And Yazen himself agrees with that.

Errors are injected into VMs by the hypervisor when real memory hardware
errors occur on the system that impact the VM address space.

And?

The injected error is received by the VM kernel to deal with it.

Why?

The VM kernel executes the same mechanisms used on bare metal in that case.
As Tony said on Feb 9: The guest may be able to just kill a process and keep running.

What's the recovery action scenario for having errors injected into guests?

Just the same as running on real HW.

Where is that documented? Why does the upstream kernel need to care?

Sorry I don't have a kernel documentation pointer about that, but the MCE relay mechanism sure is an Hypervisor functionality.

Basically I'm asking you for the use case in order to determine whether that
use case is valid for the *upstream* kernel to support.

Yes, of course, see below.

This is not only a test, this is real life mechanism. With the fix
7cb735d7c0cb that has been integrated, VMs kernel running on AMD now crashes
on Deferred errors, where it used to be able to deal with them before this
commit.

Because we don't know of your use case. So when we do upstream development how
can we test your case?

I have a procedure to verify the behavior: It consists of running the upstream kernel in a VM (on an AMD platform) and injecting a memory error from the hardware platform to this VM to mimic a real hardware error being reported to the platform Kernel.

To do so:
Run Qemu as root (to help with the address translation).
The VM runs the upstream kernel.
Run the small attached program in the VM as root, so that it gives a guest physical address of one of its mapped memory page.

[root@VM]# ./mce_process_react_x86
Setting Early kill... Ok

Data pages at 0xXXXXXXX physically 0xYYYYY000

-> DON'T Press enter ! (just leave the process wait here)

Ask the emulator (QEMU in this case) to give the host physical address of the guest physical page:
(qemu) gpa2hpa 0xYYYYY000
Host physical address for 0xYYYYY000 (pc.ram) is 0xPFN000

From the host physical address get the pfn value (removing the last 3 zeros of the address) to poison.

On the host, use hwpoison kernel module:
[root@host]# modprobe hwpoison_inject

and inject an error to the targeted pfn:
[root@host]# echo 0xPFN > /sys/kernel/debug/hwpoison/corrupt-pfn

Than wait until the Asynchronous error generated reaches the VM (it can take up to 5 minutes on AMD virtualization) to see the VM kernel deal with it.

Without this suggested fix, the VM kernel panics, with the stack trace I gave:

mce: MSR access error: WRMSR to 0xc0002098 (tried to write 0x0000000000000000)
at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)

amd_clear_bank+0x6e/0x70
machine_check_poll+0x228/0x2e0
? __pfx_mce_timer_fn+0x10/0x10
mce_timer_fn+0xb1/0x130
? __pfx_mce_timer_fn+0x10/0x10
call_timer_fn+0x26/0x120
__run_timers+0x202/0x290
run_timer_softirq+0x49/0x100
handle_softirqs+0xeb/0x2c0
__irq_exit_rcu+0xda/0x100
sysvec_apic_timer_interrupt+0x71/0x90
[...]
Kernel panic - not syncing: MCA architectural violation!

With the fix the VM Kernel deals with the error:

[root@VM]# ./mce_process_react_x86
Setting Early kill... Ok
Data pages at 0x7fa0f9b25000 physically 0x172929000

(qemu) gpa2hpa 0x172929000
Host physical address for 0x172929000 (pc.ram) is 0x237129000

-> Injecting the error with:
[root@host]# echo 0x237129 > /sys/kernel/debug/hwpoison/corrupt-pfn

-> The VM monitor indicates:
qemu-kvm: warning: Guest MCE Memory Error at QEMU addr 0x7f3ae2729000 and GUEST addr 0x172929000 of type BUS_MCEERR_AO injected

-> A few minutes later, the VM console shows:
localhost login: [ 332.973864] mce: [Hardware Error]: Machine check events logged
[ 332.976795] Memory failure: 0x172929: Sending SIGBUS to mce_process_rea:5607 due to hardware memory corruption
[ 332.977832] Memory failure: 0x172929: recovery action for dirty LRU page: Recovered
[ 355.056785] MCE: Killing mce_process_rea:5607 due to hardware memory corruption fault at 0x7fa0f9b25000

-> The process shows:
Signal 7 received: BUS_MCEERR_AO on vaddr: 0x7fa0f9b25000
Signal 7 received: BUS_MCEERR_AR on vaddr: 0x7fa0f9b25000
Exit from the signal handler on BUS_MCEERR_AR

-> Works as expected: AO error is relayed by the VM kernel to the application running.

Before that, is that case even worth testing?

If we accept that relayed MCEs is supported by the upstream kernel running in the VM, than yes.

I hope I'm making sense here. The MCA and other low-level hw code works on
baremetal as that's its main target. If it is supposed to work in VMs, then
there better be a proper use case which we are willing to support and we can
*actually* *test*.

The above detailed procedure can maybe help with this aspect, even if it is virtualization oriented. As I do hope that upstream kernel supports memory error handling in a VM.

But Yazen's answers about non-SMCA hardware can also help to decide what to do with this fix.

If not, you can keep this "fix" in your guest kernels and everyone's happy.

Thx.

I hope my explanations helped to better understand the context.

Thanks,
William.

#include <sys/types.h>
#include <sys/prctl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <signal.h>
#include <string.h>

#define PAGEMAP_ENTRY 8
#define GET_BIT(X,Y) (X & ((uint64_t)1<<Y)) >> Y
#define GET_PFN(X) X & 0x7FFFFFFFFFFFFF

const int __endian_bit = 1;
#define is_bigendian() ( (*(char*)&__endian_bit) == 0 )
static long pgsz;

/*
* Set the early kill mode reaction state to MCE error.
*/
static void early_reaction() {
printf("Setting Early kill... ");
if (prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_EARLY, 0, 0) == 0)
printf("Ok\n");
else
printf("Failure !\n");
}

/*
* Return the physical address associated to a given local virtual address,
* or -1 in case of an error.
*/
static uint64_t physical_address(uint64_t virt_addr) {
char path_buf [0x100];
FILE * f;
uint64_t read_val, file_offset, pfn = 0;
unsigned char c_buf[PAGEMAP_ENTRY];
pid_t my_pid = getpid();
int status, i;

sprintf(path_buf, "/proc/%u/pagemap", my_pid);

f = fopen(path_buf, "rb");
if(!f){
printf("Error! Cannot open %s\n", path_buf);
return (uint64_t)-1;
}

file_offset = virt_addr / (uint64_t)pgsz * PAGEMAP_ENTRY;
status = fseek(f, (long)file_offset, SEEK_SET);
if(status){
perror("Failed to do fseek!");
fclose(f);
return (uint64_t)-1;
}

for(i=0; i < PAGEMAP_ENTRY; i++){
int c = getc(f);
if(c==EOF){
fclose(f);
return (uint64_t)-1;
}
if(is_bigendian())
c_buf[i] = (unsigned char)c;
else
c_buf[PAGEMAP_ENTRY - i - 1] = (unsigned char)c;
}
fclose(f);

read_val = 0;
for(i=0; i < PAGEMAP_ENTRY; i++){
read_val = (read_val << 8) + c_buf[i];
}

if(GET_BIT(read_val, 63)) {
pfn = GET_PFN(read_val);
} else {
printf("Page not present !\n");
}
if(GET_BIT(read_val, 62))
printf("Page swapped\n");

if (pfn == 0)
return (uint64_t)-1;

return pfn * (uint64_t)pgsz;
}

/*
* SIGBUS handler to display the given information.
*/
static void sigbus_action(int signum, siginfo_t *siginfo, void *ctx) {
printf("Signal %d received: ", signum);
printf("%s on vaddr: %p\n",
(siginfo->si_code == 4? "BUS_MCEERR_AR":"BUS_MCEERR_AO"),
siginfo->si_addr);

if (siginfo->si_code == 4) { /* BUS_MCEERR_AR */
fprintf(stderr, "Exit from the signal handler on BUS_MCEERR_AR\n");
_exit(1);
}
}

int main(int argc, char ** argv) {
struct sigaction my_sigaction;
uint64_t virt_addr = 0, phys_addr;
void *local_pnt;

// Need to have the CAP_SYS_ADMIN capability to get PFNs values in pagemap.
if (getuid() != 0) {
fprintf(stderr, "Usage: %s needs to run as root\n", argv[0]);
exit(EXIT_FAILURE);
}

// attach our SIGBUS handler.
memset(&my_sigaction, 0, sizeof(my_sigaction));
my_sigaction.sa_sigaction = sigbus_action;
my_sigaction.sa_flags = SA_SIGINFO | SA_NODEFER;
sigemptyset(&my_sigaction.sa_mask);
if (sigaction(SIGBUS, &my_sigaction, NULL) == -1) {
perror("Signal handler attach failed");
exit(EXIT_FAILURE);
}

pgsz = sysconf(_SC_PAGESIZE);
if (pgsz == -1) {
perror("sysconf(_SC_PAGESIZE)");
exit(EXIT_FAILURE);
}
early_reaction();

// Allocate a private page.
local_pnt = mmap(NULL, pgsz, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0);
if (local_pnt == MAP_FAILED) {
fprintf(stderr, "Memory Allocation failed !\n");
exit(EXIT_FAILURE);
}
virt_addr = (uint64_t)local_pnt;

// Dirty / map the page.
sprintf((char *)local_pnt, "My page\n");

phys_addr = physical_address(virt_addr);
if (phys_addr == -1) {
fprintf(stderr, "Virtual address translation 0x%llx failed\n",
(unsigned long long)virt_addr);
exit(EXIT_FAILURE);
}
printf("\nData pages at 0x%llx physically 0x%llx\n",
(unsigned long long)virt_addr, (unsigned long long)phys_addr);
fflush(stdout);

printf("\nPress ENTER to continue\n");
fgetc(stdin);

// read the string at the beginning of page.
printf("%s", (char *)local_pnt);

phys_addr = physical_address(virt_addr);
if (phys_addr == -1) {
fprintf(stderr, "Virtual address translation 0x%llx failed\n",
(unsigned long long)virt_addr);
} else {
printf("\nData pages at 0x%llx physically 0x%llx\n",
(unsigned long long)virt_addr, (unsigned long long)phys_addr);
}

return 0;
}