Re: [PATCH 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter
From: Andi Kleen
Date: Tue May 30 2017 - 13:51:56 EST
On Tue, May 30, 2017 at 07:40:14PM +0200, Peter Zijlstra wrote:
> On Tue, May 30, 2017 at 10:22:08AM -0700, Andi Kleen wrote:
> > > > You would only need a single one per system however, not one per CPU.
> > > > RCU already tracks all the CPUs, all we need is a single NMI watchdog
> > > > that makes sure RCU itself does not get stuck.
> > > >
> > > > So we just have to find a single watchdog somewhere that can trigger
> > > > NMI.
> > >
> > > But then you have to IPI broadcast the NMI, which is less than ideal.
> >
> > Only when the watchdog times out to print the backtraces.
>
> The current NMI watchdog has a per-cpu state. So that means either doing
> for_all_cpu() loops or IPI broadcasts from the NMI tickle. Neither is
> something you really want.
The normal case is that the RCU stall only prints the backtrace for
the CPU that stalled.
The extra NMI watchdog should only kick in when RCU is broken too,
or the CPU that owns the stall detection stalled too, which should be rare.
In this case it's reasonable to print backtrace for all, like sysrq would do.
In theory could try to figure out what the current CPU that would own stall
detection is, but it's probably safer to do it for all.
BTW there's an alternative solution in cycling the NMI watchdog over
all available CPUs. Then it would eventually cover all. But that's
less real time friendly than relying on RCU.
> > > RCU doesn't have that problem because the quiescent state is a global
> > > thing. CPU progress, which is what the NMI watchdog tests, is very much
> > > per logical CPU though.
> >
> > RCU already has a CPU stall detector. It should work (and usually
> > triggers before the NMI watchdog in my experience unless the
> > whole system is dead)
>
> It only goes look at CPU state once it detects the global QS is stalled
> I think. But I've not had much luck with the RCU one -- although I think
> its been improved since I last had a hard problem.
I've seen it trigger.
-Andi