Re: NMI watchdog dump does not print on hard lockup

From: Petr Mladek
Date: Fri Oct 13 2017 - 07:14:52 EST


On Thu 2017-10-12 12:16:58, Steven Rostedt wrote:
> static void lock_up_cpu(void *data)
> {
> unsigned long flags;
> raw_spin_lock_irqsave(&global_trace.start_lock, flags);
> raw_spin_lock(&global_trace.start_lock);
> raw_spin_unlock(&global_trace.start_lock);
> raw_spin_unlock_irqrestore(&global_trace.start_lock, flags);
> }
>
> [..]
>
> on_each_cpu(lock_up_cpu, NULL, 1);
>
> This too triggered the warning. But I noticed that the calling function
> didn't hard lockup. (Not all CPUs were hard locked).
>
> Finally I did:
>
> on_each_cpu(lock_up_cpu, NULL, 0);
> lock_up_cpu(tr);
>
> And boom! It locked up (lockdep was enabled, so I could see it showing
> the deadlock), but then it stopped there. No output. The NMI watchdog
> will only detect hard lockups if there is at least one CPU that is
> still active. This could be an issue on non SMP boxes.
>
> We need a way to have NMI flush to consoles when a lockup is detected,
> and not depend on an irq_work to do so.


I thought that enabling CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE
could help. panic() flushes the printk_save buffers, see
printk_safe_flush_on_panic(). But it somehow does not help.
I need to dig more into it.

In general, we could either improve detection of situations when
the entire system is locked. It would be a reason to risk calling
consoles even in NMI.

Or we could accept that the "default" printk is not good for all
situations and allow more special "debugging" modes:

+ Peter's force_early_printk stuff

+ Allow to disable printk_safe and printk_safe_nmi.
There will be a risk of a deadlock caused by printk.
But there also will be a chance to see the messages.


Best Regards,
Petr