Re: [PATCH 0/4] watchdog: Better handling of concurrent lockups

From: Doug Anderson
Date: Tue Feb 06 2024 - 14:31:53 EST


Hi,

On Tue, Feb 6, 2024 at 2:46 AM John Ogness <john.ogness@xxxxxxxxxxxxx> wrote:
>
> On 2024-02-06, Petr Mladek <pmladek@xxxxxxxx> wrote:
> > I have just got an idea how to make printk_cpu_sync_get_irqsave()
> > less error prone for deadlock on the panic() CPU. The idea is
> > to ignore the lock or give up locking after a timeout on
> > the panic CPU.
>
> This idea is out of scope for this series. But it is something we should
> think about. The issue has always been a possible problem in panic().

One thing to be at least a little cognizant of is how this interacts
with the 10 second timeout in nmi_trigger_cpumask_backtrace(), which
we can hit twice in some of the lockup reports since we first trace
the locked CPU and then the rest. Ideally we don't hit that timeout
lots, except that on arm64 if you don't have pseudo-NMI turned on then
it's actually pretty easy to hit the timeout when you've got a
hard-locked CPU. Probably that 10 second timeout should be
shortened...

-Doug