Re: [RESEND PATCH 2/2] perf/x86: improve the event scheduling to avoid unnecessary pmu_stop/start

From: Wen Yang
Date: Wed Mar 09 2022 - 22:50:48 EST




在 2022/3/8 下午8:50, Peter Zijlstra 写道:
On Sun, Mar 06, 2022 at 10:36:38PM +0800, Wen Yang wrote:

The perf event generated by the above script A (cycles:D), and the counter
it used changes from #1 to #3. We use perf event in pinned mode, and then
continuously read its value for a long time, but its PMU counter changes

Yes, so what?

, so the counter value will also jump.

I fail to see how the counter value will jump when we reprogram the
thing. When we stop we update the value, then reprogram on another
counter and continue. So where does it go sideways?


0xffff88b72db85800:
The perf event generated by the above script A (instructions:D), which has
always occupied #fixed_instruction.

0xffff88bf46c34000, 0xffff88bf46c35000, 0xffff88bf46c30000:
Theses perf events are generated by the above script B.



so it will cause unnecessary pmu_stop/start and also cause abnormal cpi.

How?!?


We may refer to the x86_pmu_enable function:
step1: save events moving to new counters
step2: reprogram moved events into new counters

especially:

static inline int match_prev_assignment(struct hw_perf_event *hwc,
struct cpu_hw_events *cpuc,
int i)
{
return hwc->idx == cpuc->assign[i] &&
hwc->last_cpu == smp_processor_id() &&
hwc->last_tag == cpuc->tags[i];
}

I'm not seeing an explanation for how a counter value is not preserved.

Cloud servers usually continuously monitor the cpi data of some important
services. This issue affects performance and misleads monitoring.

The current event scheduling algorithm is more than 10 years old:
commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")

irrelevant


commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")

This commit is the basis of the perf event scheduling algorithm we currently
use.

Well yes. But how is the age of it relevant?

The reason why the counter above changed from #1 to #3 can be found from it:
The algorithm takes into account the list of counter constraints
for each event. It assigns events to counters from the most
constrained, i.e., works on only one counter, to the least
constrained, i.e., works on any counter.

the nmi watchdog permanently consumes one fp (*cycles*).
therefore, when the above shell script obtains *cycles:D*
again, it has to use a GP, and its weight is 5.
but other events (like *cache-misses*) have a weight of 4,
so the counter used by *cycles:D* will often be taken away.

So what?

I mean, it is known the algorithm isn't optimal, but at least it's
bounded. There are event sets that will fail to schedule but could, but
I don't think you're talking about that.

Event migrating to a different counter is not a problem. This is
expected and normal. Code *must* be able to deal with it.

In addition, we also found that this problem may affect NMI watchdog in the
production cluster.
The NMI watchdog also uses a fixed counter in fixed mode. Usually, it is The
first element of the event_list array, so it usually takes precedence and
can get a fixed counter.
But if the administrator closes the watchdog first and then enables it, it
may be at the end of the event_list array, so its expected fixed counter may
be occupied by other perf event, and it can only use one GP. In this way,
there is a similar issue here: the PMU counter used by the NMI watchdog may
be disabled/enabled frequently and unnecessarily.

Again, I'm not seeing a problem. If you create more events than we have
hardware counters we'll rotate the list and things will get scheduled in
all sorts of order. This works.

Any advice or guidance on this would be appreciated.

I'm still not sure what your actual problem is; I suspect you're using
perf wrong.

Are you using rdpmc and not respecting the scheme described in
include/uapi/linux/perf_events.h:perf_event_mmap_page ?

Note that if you're using pinned counters you can simplify that scheme
by ignoring all the timekeeping nonsense. In that case it does become
significantly simpler/faster.

But you cannot use rdpmc without using the mmap page's self-monitoring
data.



The current algorithm is outstanding and has been running stably for so many years, and all existing performance work benefits from it.

We're just trying to optimize a little bit.

As you pointed out, some non-compliant rdpmc can cause problems. But you also know that linux is the foundation of cloud servers, and many third-party programs run on it (we don't have any code for it), and we can only observe that the monitoring data will jitter abnormally (the probability of this issue is not high, about dozens of tens of thousands of machines).

We also see that the existing perf code is also trying to avoid frequent changes of the pmu counter, such as:


/*
* fastpath, try to reuse previous register
*/
for (i = 0; i < n; i++) {
u64 mask;

hwc = &cpuc->event_list[i]->hw;
c = cpuc->event_constraint[i];

/* never assigned */
if (hwc->idx == -1)
break;

/* constraint still honored */
if (!test_bit(hwc->idx, c->idxmsk))
break;

mask = BIT_ULL(hwc->idx);
if (is_counter_pair(hwc))
mask |= mask << 1;

/* not already used */
if (used_mask & mask)
break;

used_mask |= mask;

if (assign)
assign[i] = hwc->idx;
}


But it only works when the perf events are reduced.

and nother example (via the PERF_HES_ARCH flag):

* step1: save events moving to new counters
*/
for (i = 0; i < n_running; i++) {
...
/*
* Ensure we don't accidentally enable a stopped
* counter simply because we rescheduled.
*/
if (hwc->state & PERF_HES_STOPPED)
hwc->state |= PERF_HES_ARCH;

x86_pmu_stop(event, PERF_EF_UPDATE);
}

/*
* step2: reprogram moved events into new counters
*/
for (i = 0; i < cpuc->n_events; i++) {
...

if (hwc->state & PERF_HES_ARCH)
continue;

x86_pmu_start(event, PERF_EF_RELOAD);
}


Now we are just follow this idea.

In the scenario of increasing perf events, we also try to reuse the previous PMU counters.
Finally, if this attempt fails, still fall back to the original algorithm and reassign all PMU counters. eg:

/* slow path */
- if (i != n)
- unsched = _perf_assign_events(constraints, n,
- wmin, wmax, gpmax, assign);
+ if (i != n) {
+ unsched = _perf_assign_events(constraints, n, wmin, wmax, gpmax,
+ msk_counters, msk_event, assign);
+ if (unsched)
+ unsched = _perf_assign_events(constraints, n, wmin, wmax, gpmax,
+ 0, 0, assign);
+ }


We have applied the above optimizations in some small-scale clusters and verified that the abnormal cpi data has indeed disappeared. It might also improve performance a bit.


Your guidance has been particularly helpful.

Best wishes,
Wen