Re: [PATCH v2 2/5] x86/mce: dump error msg from severities

From: Yazen Ghannam
Date: Mon Mar 03 2025 - 13:09:01 EST


On Mon, Mar 03, 2025 at 04:49:25PM +0000, Luck, Tony wrote:
> > The error context is in the behavior of the hw. If the error is fatal, you
> > won't see it - the machine will panic or do something else to prevent error
> > propagation. It definitely won't run any software anymore.
> >
> > If you see the error getting logged, it means it is not fatal enough to kill
> > the machine.
>
> One place in the fatal case where I would like to see more information is the
>
> "Action required: data load in error *UN*recoverable area of kernel"
>
> [emphasis on the "UN" added].
>
> case. We have a few places where the kernel does recover. And most places
> we crash. Our code for the recoverable cases is fragile. Most of this series is
> about repairing regressions where we used to recover from places where kernel
> is doing get_user() or copy_from_user() which can be recovered if those places
> get an error return and the kernel kills the process instead of crashing.
>
> A long time ago I posted some patches to include a stack trace for this type
> of crash. It didn't make it into the kernel, and I got distracted by other things.
>
> If we had that, it would have been easier to diagnose this regression (Shaui
> Xie would have seen crashes with a stack trace pointing to code that used
> to recover in older kernels). Folks with big clusters would also be able to
> point out other places where the kernel crashes often enough that additional
> EXTABLE recovery paths would be worth investigating.
>
> So:
>
> 1) We need to fix the regressions. That just needs new commit messages
> for these patches that explain the issue better.
>
> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
>
> 3) I don't see much value in a message that reports the recoverable case.
>
> Yazen: At one point I think you said you were looking at adding additional
> decorations to the return value from mce_severity() to indicate actions
> needed for recoverable errors (kill the process, offline the page) rather
> than have do_machine_check() figure it out by looking at various fields
> in the "struct mce". Did that go anywhere? Those extra details might be
> interesting in the tracepoint.
>

Hi Tony,

Yes, I have a patch here:
https://github.com/AMDESE/linux/commit/cf0b8a97240abf0fbd98a91cd8deb262f827721b

Branch:
https://github.com/AMDESE/linux/commits/wip-mca/

This work is at the tail-end of a lot of other refactoring. But it can
be prioritized if there's interest. Most of the dependencies have
already been merged.

Thanks,
Yazen