Re: [PATCH v1] panic: push panic() messages to the console even from the MCE nmi handler

From: William Roche
Date: Mon Jan 11 2021 - 21:20:51 EST


Thank you for your clarification, and let me explain what I was
talking about before providing my last code version.

>>>> kmsg_dump(KMSG_DUMP_PANIC);
>>>> + panic_flush_to_console();
>
> This is wrong. kmsg_dump() flushes the messages into the registered
> dumpers, e.g. pstore. It handles only messages in the main
> log buffer.
>
> printk_safe_flush_on_panic() must be called before. It moves any
> pending messages from the temporary per-CPU buffers into the main
> log buffer so that they are visible for the dumpers.

Ok, this is very clear.

>>> Why?
>>
>> Here, we are supposed to push the pending messages, as I could verify
>> that they don't reach the console until the console_unblank(). So I
>> wanted to push them with panic_flush_to_console() before the possible
>> upcoming __crash_kexec() call.
>
> I do not see how the ordering of printk_safe_flush_on_panic()
> and kmsg_dump() would change anything for the upcoming
> __crash_kexec() call.

I was wrong to remove the call to printk_safe_flush_on_panic() before
kmsg_dump() as it was in panic_flush_to_console().

The call panic_flush_to_console() idea here was to push any messages
that could have been submitted (even by any dumper) to the console
before the possible upcoming __crash_kexec() call. Otherwise they could
be discarded (if the crash messages can't be retrieved).

And that's the reason why I've changed this code in my previous version
to force a call to panic_flush_to_console() before the __crash_kexec().
Which seems to be too insecure as far as I understand it now.


In your previous email you also indicate:

>> . With the use of "crashkernel", a Fatal MCE injection shows only the
>> injection message
>>
>> [ 80.811708] mce: Machine check injector initialized
>> [ 92.298755] mce: Triggering MCE exception on CPU 0
>> [ 92.299362] Disabling lock debugging due to kernel taint
>>
>> No other messages is displayed and the system reboots immediately.
>
> But you could find the messages in the crashdump. Aren't you?

Yes, but _only_ with the latest kexec-tools-2.0.21.git version !
[My installed kexec-tools.x86_64 2.0.20-34.0.4.el8_3 doesn't seem to
work, and I guess there is a separate problem that I will investigate
later.]


>
> It works this way by "design". The idea is the following:
>

That's a rather disarming statement :)

I'm sure we can agree that getting the information directly (and of
course safely) from the running kernel console would be a better option,
instead of having a 2 steps process to get the crash reason (even when
it works) as it adds a risk not to get this crash reason at all. And I
know for a fact that Cloud people do want to know why a platform goes
away ;)


> Taking any locks from NMI context might lead to a deadlock.
> Re-initializing the locks might lead to deadlock as well
> because of possible double unlock. Ignoring the locks might
> lead to problems either.
>
> A compromise is needed:
>
> 1. crashdump disabled
>
> console_flush_on_panic() is called. It tries hard to get the
> messages on the console because it is the only chance.
>
> It does console_trylock(). It is called after
> bust_spinlocks(1) so that even the console-specific locks
> are taken only with trylock, see oops_in_progress.
>
> BTW: There are people that do not like this because there
> is still a risk of a deadlock. Some code paths
> take locks without checking oops_in_progress.
> For these people, more reliable reboot is more
> important because they want to have the system
> back ASAP (cloud people).
>
>
> 2. crashdump enabled:
>
> Only printk_safe_flush_on_panic() is called. It does the best effort
> to flush messages from the per-CPU buffers into the main log buffer
> so that they can be found easily in the core.
>
> It is pretty reliable. It should not be needed at all once the new
> lockless ringbuffer gets fully integrated,
>
> It does not try to flush the messages to the console. Getting
> the crash dump is more important than risking a deadlock with
> consoles.

Thanks again for the good clarification. I did not realize that calling
console_flush_on_panic(CONSOLE_FLUSH_PENDING) from panic() before our
__crash_kexec() was not safe enough.

So here is a new version of a (minimal) fix trying to follow your
suggestions:
(And you'll probably remove the added calls to
printk_safe_flush_on_panic() when integrating the new lockless
ringbuffer)

What do you think of this simplified possibility ?:
[If this fix is correct, I'll submit it as a v2 fix, with an updated comment and Notes]

--- kernel/panic.c 2021-01-11 20:33:25.268047057 -0500
***************
*** 166,171 ****
--- 166,180 ----
ftrace_dump(DUMP_ALL);
}

+ /*
+ * Force flush messages to the console.
+ */
+ static void panic_flush_to_console(void)
+ {
+ printk_safe_flush_on_panic();
+ console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+ }
+
/**
* panic - halt the system
* @fmt: The text string to print
***************
*** 284,291 ****
*
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
! if (_crash_kexec_post_notifiers)
__crash_kexec(NULL);

#ifdef CONFIG_VT
unblank_screen();
--- 293,302 ----
*
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
! if (_crash_kexec_post_notifiers) {
! printk_safe_flush_on_panic();
__crash_kexec(NULL);
+ }

#ifdef CONFIG_VT
unblank_screen();
***************
*** 298,304 ****
* buffer. Try to acquire the lock then release it regardless of the
* result. The release will also print the buffers out. Locks debug
* should be disabled to avoid reporting bad unlock balance when
! * panic() is not being callled from OOPS.
*/
debug_locks_off();
console_flush_on_panic(CONSOLE_FLUSH_PENDING);
--- 309,315 ----
* buffer. Try to acquire the lock then release it regardless of the
* result. The release will also print the buffers out. Locks debug
* should be disabled to avoid reporting bad unlock balance when
! * panic() is not being called from OOPS.
*/
debug_locks_off();
console_flush_on_panic(CONSOLE_FLUSH_PENDING);
***************
*** 314,319 ****
--- 325,331 ----
* We can't use the "normal" timers since we just panicked.
*/
pr_emerg("Rebooting in %d seconds..\n", panic_timeout);
+ panic_flush_to_console();

for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
touch_nmi_watchdog();
***************
*** 347,352 ****
--- 359,365 ----
disabled_wait();
#endif
pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
+ panic_flush_to_console();

/* Do not scroll important messages printed above */
suppress_printk = 1;