Optimize perf stat for large number of events/cpus v2
From: Andi Kleen
Date: Sun Oct 20 2019 - 13:52:45 EST
[The earlier v1 version had a lot of conflicts against some
recent libperf changes in tip/perf/core. Resolve that and
also fix some minor issues.]
This patch kit optimizes perf stat for a large number of events
on systems with many CPUs and PMUs.
Some profiling shows that the most overhead is doing IPIs to
all the target CPUs. We can optimize this by using sched_setaffinity
to set the affinity to a target CPU once and then doing
the perf operation for all events on that CPU. This requires
some restructuring, but cuts the set up time quite a bit.
In theory we could go further by parallelizing these setups
too, but that would be much more complicated and for now just batching it
per CPU seems to be sufficient. At some point with many more cores
parallelization or a better bulk perf setup API might be needed though.
In addition perf does a lot of redundant /sys accesses with
many PMUs, which can be also expensve. This is also optimized.
On a large test case (>700 events with many weak groups) on a 94 CPU
system I go from
so shaving ~6 seconds of system time, at slightly more cost
in perf stat itself. On a 4 socket system with the savings
are more dramatic:
so 11s difference in the user visible set up time.
Also available in
v1: Initial post.
v2: Rebase. Fix some minor issues.