[PATCH] perf/core: fix multiplexing event scheduling issue

From: Stephane Eranian
Date: Thu Oct 17 2019 - 20:29:23 EST


This patch complements the following commit:
7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")

The fix from Song addresses the consequences of the problem but
not the cause. This patch fixes the causes and can sit on top of
Song's patch.

This patch fixes a scheduling problem in the core functions of
perf_events. Under certain conditions, some events would not be
scheduled even though many counters would be available. This
is related to multiplexing and is architecture agnostic and
PMU agnostic (i.e., core or uncore).

This problem can easily be reproduced when you have two perf
stat sessions. The first session does not cause multiplexing,
let's say it is measuring 1 event, E1. While it is measuring,
a second session starts and causes multiplexing. Let's say it
adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
When the second session terminates, all 6 (B1-B6) events are
removed. Normally, you'd expect the E1 event to continue to run
with no multiplexing. However, the problem is that depending on
the state Of E1 when B1-B6 are removed, it may never be scheduled
again. If E1 was inactive at the time of removal, despite the
multiplexing hrtimer still firing, it will not find any active
events and will not try to reschedule. This is what Song's patch
fixes. It forces the multiplexing code to consider non-active events.
However, the cause is not addressed. The kernel should not rely on
the multiplexing hrtimer to unblock inactive events. That timer
can have abitrary duration in the milliseconds. Until the timer
fires, counters are available, but no measurable events are using
them. We do not want to introduce blind spots of arbitrary durations.

This patch addresses the cause of the problem, by checking that,
when an event is disabled or removed and the context was multiplexing
events, inactive events gets immediately a chance to be scheduled by
calling ctx_resched(). The rescheduling is done on event of equal
or lower priority types. With that in place, as soon as a counter
is freed, schedulable inactive events may run, thereby eliminating
a blind spot.

This can be illustrated as follows using Skylake uncore CHA here:

$ perf stat --no-merge -a -I 1000 -C 28 -e uncore_cha_0/event=0x0/
42.007856531 2,000,291,322 uncore_cha_0/event=0x0/
43.008062166 2,000,399,526 uncore_cha_0/event=0x0/
44.008293244 2,000,473,720 uncore_cha_0/event=0x0/
45.008501847 2,000,423,420 uncore_cha_0/event=0x0/
46.008706558 2,000,411,132 uncore_cha_0/event=0x0/
47.008928543 2,000,441,660 uncore_cha_0/event=0x0/

Adding second sessiont with 4 events for 4s

$ perf stat -a -I 1000 -C 28 --no-merge -e uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/ sleep 4
48.009114643 1,983,129,830 uncore_cha_0/event=0x0/ (99.71%)
49.009279616 1,976,067,751 uncore_cha_0/event=0x0/ (99.30%)
50.009428660 1,974,448,006 uncore_cha_0/event=0x0/ (98.92%)
51.009524309 1,973,083,237 uncore_cha_0/event=0x0/ (98.55%)
52.009673467 1,972,097,678 uncore_cha_0/event=0x0/ (97.11%)

End of 4s, second session is removed. But the first event does not schedule and never will
unless new events force multiplexing again.

53.009815999 <not counted> uncore_cha_0/event=0x0/ (95.28%)
54.009961809 <not counted> uncore_cha_0/event=0x0/ (93.52%)
55.010110972 <not counted> uncore_cha_0/event=0x0/ (91.82%)
56.010217579 <not counted> uncore_cha_0/event=0x0/ (90.18%)

Signed-off-by: Stephane Eranian <eranian@xxxxxxxxxx>
---
kernel/events/core.c | 67 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 67 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9ec0b0bfddbd..578587246ffb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2140,6 +2140,10 @@ group_sched_out(struct perf_event *group_event,

#define DETACH_GROUP 0x01UL

+static void ctx_resched(struct perf_cpu_context *cpuctx,
+ struct perf_event_context *task_ctx,
+ enum event_type_t event_type);
+
/*
* Cross CPU call to remove a performance event
*
@@ -2153,6 +2157,7 @@ __perf_remove_from_context(struct perf_event *event,
void *info)
{
unsigned long flags = (unsigned long)info;
+ int was_necessary = ctx->rotate_necessary;

if (ctx->is_active & EVENT_TIME) {
update_context_time(ctx);
@@ -2171,6 +2176,37 @@ __perf_remove_from_context(struct perf_event *event,
cpuctx->task_ctx = NULL;
}
}
+
+ /*
+ * sanity check that event_sched_out() does not and will not
+ * change the state of ctx->rotate_necessary
+ */
+ WARN_ON(was_necessary != event->ctx->rotate_necessary);
+ /*
+ * if we remove an event AND we were multiplexing then, that means
+ * we had more events than we have counters for, and thus, at least,
+ * one event was in INACTIVE state. Now, that we removed an event,
+ * we need to resched to give a chance to all events to get scheduled,
+ * otherwise some may get stuck.
+ *
+ * By the time this function is called the event is usually in the OFF
+ * state.
+ * Note that this is not a duplicate of the same code in _perf_event_disable()
+ * because the call path are different. Some events may be simply disabled
+ * others removed. There is a way to get removed and not be disabled first.
+ */
+ if (ctx->rotate_necessary && ctx->nr_events) {
+ int type = get_event_type(event);
+ /*
+ * In case we removed a pinned event, then we need to
+ * resched for both pinned and flexible events. The
+ * opposite is not true. A pinned event can never be
+ * inactive due to multiplexing.
+ */
+ if (type & EVENT_PINNED)
+ type |= EVENT_FLEXIBLE;
+ ctx_resched(cpuctx, cpuctx->task_ctx, type);
+ }
}

/*
@@ -2218,6 +2254,8 @@ static void __perf_event_disable(struct perf_event *event,
struct perf_event_context *ctx,
void *info)
{
+ int was_necessary = ctx->rotate_necessary;
+
if (event->state < PERF_EVENT_STATE_INACTIVE)
return;

@@ -2232,6 +2270,35 @@ static void __perf_event_disable(struct perf_event *event,
event_sched_out(event, cpuctx, ctx);

perf_event_set_state(event, PERF_EVENT_STATE_OFF);
+ /*
+ * sanity check that event_sched_out() does not and will not
+ * change the state of ctx->rotate_necessary
+ */
+ WARN_ON_ONCE(was_necessary != event->ctx->rotate_necessary);
+
+ /*
+ * if we disable an event AND we were multiplexing then, that means
+ * we had more events than we have counters for, and thus, at least,
+ * one event was in INACTIVE state. Now, that we disabled an event,
+ * we need to resched to give a chance to all events to be scheduled,
+ * otherwise some may get stuck.
+ *
+ * Note that this is not a duplicate of the same code in
+ * __perf_remove_from_context()
+ * because events can be disabled without being removed.
+ */
+ if (ctx->rotate_necessary && ctx->nr_events) {
+ int type = get_event_type(event);
+ /*
+ * In case we removed a pinned event, then we need to
+ * resched for both pinned and flexible events. The
+ * opposite is not true. A pinned event can never be
+ * inactive due to multiplexing.
+ */
+ if (type & EVENT_PINNED)
+ type |= EVENT_FLEXIBLE;
+ ctx_resched(cpuctx, cpuctx->task_ctx, type);
+ }
}

/*
--
2.23.0.866.gb869b98d4c-goog