On Sun, Mar 06, 2022 at 10:36:38PM +0800, Wen Yang wrote:
The perf event generated by the above script A (cycles:D), and the counter
it used changes from #1 to #3. We use perf event in pinned mode, and then
continuously read its value for a long time, but its PMU counter changes
Yes, so what?
, so the counter value will also jump.
I fail to see how the counter value will jump when we reprogram the
thing. When we stop we update the value, then reprogram on another
counter and continue. So where does it go sideways?
0xffff88b72db85800:
The perf event generated by the above script A (instructions:D), which has
always occupied #fixed_instruction.
0xffff88bf46c34000, 0xffff88bf46c35000, 0xffff88bf46c30000:
Theses perf events are generated by the above script B.
so it will cause unnecessary pmu_stop/start and also cause abnormal cpi.
How?!?
We may refer to the x86_pmu_enable function:
step1: save events moving to new counters
step2: reprogram moved events into new counters
especially:
static inline int match_prev_assignment(struct hw_perf_event *hwc,
struct cpu_hw_events *cpuc,
int i)
{
return hwc->idx == cpuc->assign[i] &&
hwc->last_cpu == smp_processor_id() &&
hwc->last_tag == cpuc->tags[i];
}
I'm not seeing an explanation for how a counter value is not preserved.
Cloud servers usually continuously monitor the cpi data of some important
services. This issue affects performance and misleads monitoring.
The current event scheduling algorithm is more than 10 years old:
commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")
irrelevant
commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")
This commit is the basis of the perf event scheduling algorithm we currently
use.
Well yes. But how is the age of it relevant?
The reason why the counter above changed from #1 to #3 can be found from it:
The algorithm takes into account the list of counter constraints
for each event. It assigns events to counters from the most
constrained, i.e., works on only one counter, to the least
constrained, i.e., works on any counter.
the nmi watchdog permanently consumes one fp (*cycles*).
therefore, when the above shell script obtains *cycles:D*
again, it has to use a GP, and its weight is 5.
but other events (like *cache-misses*) have a weight of 4,
so the counter used by *cycles:D* will often be taken away.
So what?
I mean, it is known the algorithm isn't optimal, but at least it's
bounded. There are event sets that will fail to schedule but could, but
I don't think you're talking about that.
Event migrating to a different counter is not a problem. This is
expected and normal. Code *must* be able to deal with it.
In addition, we also found that this problem may affect NMI watchdog in the
production cluster.
The NMI watchdog also uses a fixed counter in fixed mode. Usually, it is The
first element of the event_list array, so it usually takes precedence and
can get a fixed counter.
But if the administrator closes the watchdog first and then enables it, it
may be at the end of the event_list array, so its expected fixed counter may
be occupied by other perf event, and it can only use one GP. In this way,
there is a similar issue here: the PMU counter used by the NMI watchdog may
be disabled/enabled frequently and unnecessarily.
Again, I'm not seeing a problem. If you create more events than we have
hardware counters we'll rotate the list and things will get scheduled in
all sorts of order. This works.
Any advice or guidance on this would be appreciated.
I'm still not sure what your actual problem is; I suspect you're using
perf wrong.
Are you using rdpmc and not respecting the scheme described in
include/uapi/linux/perf_events.h:perf_event_mmap_page ?
Note that if you're using pinned counters you can simplify that scheme
by ignoring all the timekeeping nonsense. In that case it does become
significantly simpler/faster.
But you cannot use rdpmc without using the mmap page's self-monitoring
data.