Re: [RFC] perf/x86: PMU IRQ handler issues

From: Dave Hansen
Date: Wed May 28 2014 - 16:20:46 EST


On 05/28/2014 12:48 PM, Stephane Eranian wrote:
> Some days ago, I was alerted that under important network load, something
> is going wrong with perf_event sampling in frequency mode (such as perf top).
> The number of samples was way too low given the cycle count (via perf stat).
> Looking at the syslog, I noticed that the perf irq latency throttler
> had kicked in
> several times. There may have been several reasons for this.
>
> Maybe the workload had changing phases and the frequency adjustments
> was not working properly and dropping to very small period and then generated
> flood of interrupts.

The problem description here is pretty fuzzy. Could you give some
actual numbers describing the issues that you're seeing, including the
ftrace that Andi was asking for? There are also some handy tracepoints
for NMI lengths that I stuck in.

The reason that the throttling code is there is that the CPU can get in
to a state where it is doing *NOTHING* other than processing NMIs (the
biggest of which are the perf-driven ones). It doesn't start throttling
until 128 samples end up averaging more than the limit.

How large of a system is this, btw? I had the worst issues on a
160-logical-cpu system. It was much harder to get it to trouble on
smaller systems.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/