Re: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine checks in kernel context

From: Borislav Petkov
Date: Mon Oct 31 2022 - 14:36:18 EST


On Mon, Oct 31, 2022 at 05:13:10PM +0000, Luck, Tony wrote:
> > 1. If the error has raised a MCE, then we will dump stack anyway.
>
> I don't see stack dumps for machine check panics. I don't have any non-standard
> settings (I think). Nor do I see them in the panic messages that other folks send
> to me.
>
> Are you settting some CONFIG or command line option to get a stack dump?

Well, if one were sane, one would assume that one would expect to see a
stack dump when the machine panics, right? I mean, it is only fair...

And there's an attempt:

#ifdef CONFIG_DEBUG_BUGVERBOSE
/*
* Avoid nested stack-dumping if a panic occurs during oops processing
*/
if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
dump_stack();
#endif

but that oops_in_progress thing is stopping us:

[ 13.706764] mce: [Hardware Error]: CPU 2: Machine Check Exception: 6 Bank 4: fe000010000b0c0f
[ 13.706781] mce: [Hardware Error]: RIP 10:<ffffffff8103bbcb> {trigger_mce+0xb/0x10}
[ 13.706791] mce: [Hardware Error]: TSC c83826d14 ADDR e1101add1e550012 MISC cafebeef
[ 13.706795] mce: [Hardware Error]: PROCESSOR 2:a00f11 TIME 1667244167 SOCKET 0 APIC 2 microcode 1000065
[ 13.706809] mce: [Hardware Error]: Machine check: Processor Context Corrupt
[ 13.706810] panic: on entry: oops_in_progress: 1
[ 13.706812] panic: before bust_spinlocks oops_in_progress: 1
[ 13.706813] Kernel panic - not syncing: Fatal local machine check
[ 13.706814] panic: taint: 0, oops_in_progress: 2
[ 13.707133] Kernel Offset: disabled

as panic() is being entered with oops_in_progress already set to 1. That
oops_in_progress thing looks like is being used for console unblanking.

Looking at

026ee1f66aaa ("panic: fix stack dump print on direct call to panic()")

it hints that panic() might've been called twice for oops_in_progress to
be already 1 on entry.

I guess we need to figure out why that is...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette