Re: [RFC PATCH] mce: don't not enable IRQ in wait_for_panic()

From: Wu Bo
Date: Mon Sep 21 2020 - 07:36:02 EST


On 2020/9/17 18:37, Wu Bo wrote:
In my virtual machine (have 4 cpus), Use mce_inject to inject errors
into the system. After mce-inject injects an uncorrectable error,
there is a probability that the virtual machine is not reset immediately,
but hangs for more than 3000 seconds, and appeared unable to
handle kernel paging request.

The analysis reasons are as follows:
1) MCE appears on all CPUs, Currently all CPUs are in the NMI interrupt
context. cpu0 is the first to seize the opportunity to run panic
routines, and panic event should stop the other processors before
do ipmi flush_messages(). but cpu1, cpu2 and cpu3 has already
in NMI interrupt context, So the Second NMI interrupt(IPI)
will not be processed again by cpu1, cpu2 and cpu3.
At this time, cpu1,cpu2 and cpu3 did not stopped.

2) cpu1, cpu2 and cpu3 are waitting for cpu0 to finish the panic process.
if a timeout waiting for other CPUs happened, do wait_for_panic(),
the irq is enabled in the wait_for_panic() function.

3) ipmi IRQ occurs on the cpu3, and the cpu0 is doing the panic,
they have the opportunity to call the smi_event_handler()
function concurrently. the ipmi IRQ affects the panic process of cpu0.

CPU0 CPU3

|-nmi_handle do mce_panic |-nmi_handle do_machine_check
| |
|-panic() |-wait_for_panic()
| |
|-stop other cpus ---- NMI ------> (Ignore, already in nmi interrupt)
| |
|-notifier call(ipmi panic_event) |<-ipmi IRQ occurs
| |
\|/ \|/
do smi_event_handler() do smi_event_handler()

My understanding is that panic() runs with only one operational CPU
in the system, other CPUs should be stopped, if other CPUs does not stop,
at least IRQ interrupts should be disabled. The x86 architecture,
disable IRQ interrupt will not affect IPI when do mce panic,
because IPI is notified through NMI interrupt. If my analysis
is not right, please correct me, thanks.

Steps to reproduce (Have a certain probability):
1. # vim /tmp/uncorrected
CPU 1 BANK 4
STATUS uncorrected 0xc0
MCGSTATUS EIPV MCIP
ADDR 0x1234
RIP 0xdeadbabe
RAISINGCPU 0
MCGCAP SER CMCI TES 0x6
2. # modprobe mce_inject
3. # cd /tmp
4. # mce-inject uncorrected

Hi,

I tested the 5.9-rc5 version and found that the problem still exists. Is there a good solution ?

Best regards,
Wu Bo