Re: [PATCH v2 0/4] perf: Fix perf_event_attr::exclusive rotation
From: Peter Zijlstra
Date: Mon Nov 09 2020 - 06:48:26 EST
On Mon, Nov 02, 2020 at 06:41:43PM -0800, Andi Kleen wrote:
> On Mon, Nov 02, 2020 at 03:16:25PM +0100, Peter Zijlstra wrote:
> > On Sun, Nov 01, 2020 at 07:52:38PM -0800, Andi Kleen wrote:
> > > The main motivation is actually that the "multiple groups" algorithm
> > > in perf doesn't work all that great: it has quite a few cases where it
> > > starves groups or makes the wrong decisions. That is because it is very
> > > difficult (likely NP complete) problem and the kernel takes a lot
> > > of short cuts to avoid spending too much time on it.
> >
> > The event scheduling should be starvation free, except in the presence
> > of pinned events.
> >
> > If you can show starvation without pinned events, it's a bug.
> >
> > It will also always do equal or better than exclusive mode wrt PMU
> > utilization. Again, if it doesn't it's a bug.
>
> Simple example (I think we've shown that one before):
>
> (on skylake)
> $ cat /proc/sys/kernel/nmi_watchdog
> 0
> $ perf stat -e instructions,cycles,frontend_retired.latency_ge_2,frontend_retired.latency_ge_16 -a sleep 2
>
> Performance counter stats for 'system wide':
>
> 654,514,990 instructions # 0.34 insn per cycle (50.67%)
> 1,924,297,028 cycles (74.28%)
> 21,708,935 frontend_retired.latency_ge_2 (75.01%)
> 1,769,952 frontend_retired.latency_ge_16 (24.99%)
>
> 2.002426541 seconds time elapsed
>
> The second frontend_retired should be both getting 50% and the fixed events should be getting
> 100%. So several events are starved.
*should* how? Also, nothing is 0% so nothing is getting starved.
> Another similar example is trying to schedule the topdown events on Icelake in parallel to other
> groups. It works with one extra group, but breaks with two.
>
> (on icelake)
> $ cat /proc/sys/kernel/nmi_watchdog
> 0
> $ perf stat -e '{slots,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring},{branches,branches,branches,branches,branches,branches,branches,branches},{branches,branches,branches,branches,branches,branches,branches,branches}' -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 71,229,087 slots (60.65%)
> 5,066,320 topdown-bad-spec # 7.1% bad speculation (60.65%)
> 35,080,387 topdown-be-bound # 49.2% backend bound (60.65%)
> 22,769,750 topdown-fe-bound # 32.0% frontend bound (60.65%)
> 8,336,760 topdown-retiring # 11.7% retiring (60.65%)
> 424,584 branches (70.00%)
> 424,584 branches (70.00%)
> 424,584 branches (70.00%)
> 424,584 branches (70.00%)
> 424,584 branches (70.00%)
> 424,584 branches (70.00%)
> 424,584 branches (70.00%)
> 424,584 branches (70.00%)
> 3,634,075 branches (30.00%)
> 3,634,075 branches (30.00%)
> 3,634,075 branches (30.00%)
> 3,634,075 branches (30.00%)
> 3,634,075 branches (30.00%)
> 3,634,075 branches (30.00%)
> 3,634,075 branches (30.00%)
> 3,634,075 branches (30.00%)
>
> 1.001312511 seconds time elapsed
>
> A tool using exclusive hopefully will be able to do better than this.
I don't see how, exclusive will always result in equal or worse PMU
utilization, never better.