Re: [PATCH v2 1/4] nmi_backtrace: add more trigger_*_cpu_backtrace() methods
From: Paul E. McKenney
Date: Fri Mar 18 2016 - 19:55:23 EST
On Fri, Mar 18, 2016 at 09:40:25AM +0000, Daniel Thompson wrote:
> On 18/03/16 00:33, Paul E. McKenney wrote:
> >On Thu, Mar 17, 2016 at 08:17:59PM -0400, Chris Metcalf wrote:
> >>On 3/17/2016 6:55 PM, Paul E. McKenney wrote:
> >>>The RCU stall-warn stack traces can be ugly, agreed.
> >>>
> >>>That said, RCU used to use NMI-based stack traces, but switched to the
> >>>current scheme due to the NMIs having the unfortunate habit of locking
> >>>things up, which IIRC often meant no stack traces at all. If I recall
> >>>correctly, one of the problems was self-deadlock in printk().
> >>
> >>Steven Rostedt enabled the per_cpu printk func support in June 2014, and
> >>the nmi_backtrace code uses it to just capture printk output to percpu
> >>buffers, so I think it's going to be a lot more robust than earlier attempts.
> >
> >That would be a very good thing, give or take the "I think" qualifier.
> >And assuming that the target CPU is healthy enough to find its way back
> >to some place that can dump the per-CPU printk buffer. I might well
> >be overly paranoid, but I have to suspect that the probability of that
> >buffer getting dumped is reduced greatly on a CPU that isn't healthy
> >enough to respond to RCU, though.
>
> The target CPU doesn't dump the buffer. It "just" fields the NMI,
> stores the backtrace and sets a flag.
>
> The buffer is dumped to console by the requesting CPU, either when
> all backtraces have come back or when a timeout is reached.
That does sound a bit more robust, good!
> >But it seems like enabling the experiment might be useful.
> >
> >"Try enabling the NMI version. If that doesn't get you your RCU CPU
> >stall warning stack trace, try the remote-print variant."
> >
> >Or I suppose we could just do both in succession, just in case their
> >console was a serial port. ;-)
>
> I guess both might be needed but only when the target CPU is dead
> enough to fail to respond to NMI. In principle, we could exploit the
> timeout in the NMI backtrace logic and only issue the missing
> backtraces.
It would be really nice if I could call one function that used the
best strategy for getting information (including stack trace) about a
specified CPU. Ditto for getting information about a specified task,
which might be running or might be preempted at the time.
Thanx, Paul