Re: [PATCH v2 2/5] x86/mce: dump error msg from severities

From: Shuai Xue
Date: Tue Mar 04 2025 - 20:50:37 EST




在 2025/3/4 00:49, Luck, Tony 写道:
The error context is in the behavior of the hw. If the error is fatal, you
won't see it - the machine will panic or do something else to prevent error
propagation. It definitely won't run any software anymore.

If you see the error getting logged, it means it is not fatal enough to kill
the machine.

One place in the fatal case where I would like to see more information is the

"Action required: data load in error *UN*recoverable area of kernel"

[emphasis on the "UN" added].

Do you mean this one?

MCESEV(
PANIC, "Data load in unrecoverable area of kernel",
SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
KERNEL
),



case. We have a few places where the kernel does recover. And most places
we crash. Our code for the recoverable cases is fragile.Most of this series is
about repairing regressions where we used to recover from places where kernel
is doing get_user() or copy_from_user() which can be recovered if those places
get an error return and the kernel kills the process instead of crashing.

I can’t agree with you more.


A long time ago I posted some patches to include a stack trace for this type
of crash. It didn't make it into the kernel, and I got distracted by other things.

If we had that, it would have been easier to diagnose this regression (Shaui
Xie would have seen crashes with a stack trace pointing to code that used
to recover in older kernels). Folks with big clusters would also be able to
point out other places where the kernel crashes often enough that additional
EXTABLE recovery paths would be worth investigating.

Agreed, a stack trace will be helpful for debug unrecoverable cases.
The current panic message is bellow:

[ 1879.726794] mce: [Hardware Error]: CPU 178: Machine Check Exception: f Bank 1: bd80000000100134
[ 1879.726798] mce: [Hardware Error]: RIP 10:<ffffffff981d7af3> {futex_wait_setup+0x83/0xf0}
[ 1879.726807] mce: [Hardware Error]: TSC 49a1e6001c1 ADDR 80f7ada400 MISC 86 PPIN fc6b80e0ba9d616
[ 1879.726809] mce: [Hardware Error]: PROCESSOR 0:806f4 TIME 1741091252 SOCKET 1 APIC c5 microcode 2b000571
[ 1879.726811] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1879.726813] mce: [Hardware Error]: Machine check events logged
[ 1879.727166] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[ 1879.727168] Kernel panic - not syncing: Fatal local machine check


It only provides a RIP and I spent a lot time to figure out the root cause about
why get_user() and copy_from_user() fail in upstream kernel.


So:

1) We need to fix the regressions. That just needs new commit messages
for these patches that explain the issue better.

I will polish commit message.


2) I'd like to see a patch for a stack trace for the unrecoverable case.

Could you provide any reference link to your previous patch?


3) I don't see much value in a message that reports the recoverable case.


Got it.

Thanks
Shuai