Re: [PATCH] perf/core: fix multiplexing event scheduling issue

From: Stephane Eranian
Date: Wed Oct 23 2019 - 03:30:17 EST


On Mon, Oct 21, 2019 at 3:21 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> > This patch complements the following commit:
> > 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> >
> > The fix from Song addresses the consequences of the problem but
> > not the cause. This patch fixes the causes and can sit on top of
> > Song's patch.
>
> I'm tempted to say the other way around.
>
> Consider the case where you claim fixed2 with a pinned event and then
> have another fixed2 in the flexible list. At that point you're _never_
> going to run any other flexible events (without Song's patch).
>
In that case, there is no deactivation or removal of events, so yes, my patch
will not help that case. I said his patch is still useful. You gave one example,
even though in this case the rotate will not yield a reschedule of that flexible
event because fixed2 is used by a pinned event. So checking for it, will not
really help.

> This patch isn't going to help with that. Similarly, Songs patch helps
> with your situation where it will allow rotation to resume after you
> disable/remove all active events (while you still have pending events).
>
Yes, it will unblock the case where active events are deactivated or
removed. But it will delay the unblocking until the next mux timer
expires. And I am saying this is too far away in many cases. For instance,
we do not run with the 1ms timer for uncore, this is way too much overhead.
Imagine this timer is set to 10ms or event 100ms, just with Song's patch, the
inactive events would have to wait for up to 100ms to be scheduled again.
This is not acceptable for us.

> > This patch fixes a scheduling problem in the core functions of
> > perf_events. Under certain conditions, some events would not be
> > scheduled even though many counters would be available. This
> > is related to multiplexing and is architecture agnostic and
> > PMU agnostic (i.e., core or uncore).
> >
> > This problem can easily be reproduced when you have two perf
> > stat sessions. The first session does not cause multiplexing,
> > let's say it is measuring 1 event, E1. While it is measuring,
> > a second session starts and causes multiplexing. Let's say it
> > adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
> > When the second session terminates, all 6 (B1-B6) events are
> > removed. Normally, you'd expect the E1 event to continue to run
> > with no multiplexing. However, the problem is that depending on
> > the state Of E1 when B1-B6 are removed, it may never be scheduled
> > again. If E1 was inactive at the time of removal, despite the
> > multiplexing hrtimer still firing, it will not find any active
> > events and will not try to reschedule. This is what Song's patch
> > fixes. It forces the multiplexing code to consider non-active events.
>
> This; so Song's patch fixes the fundamental problem of the rotation not
> working right under certain conditions.
>
> > However, the cause is not addressed. The kernel should not rely on
> > the multiplexing hrtimer to unblock inactive events. That timer
> > can have abitrary duration in the milliseconds. Until the timer
> > fires, counters are available, but no measurable events are using
> > them. We do not want to introduce blind spots of arbitrary durations.
>
> This I disagree with -- you don't get a guarantee other than
> timer_period/n when you multiplex, and idling the counters until the
> next tick doesn't violate that at all.

My take is that if you have free counters and "idling" events, the kernel
should take every effort to schedule them as soon as they become available.
In the situation I described in the patch, once I remove the active
events, there
is no more reasons for multiplexing, all the counters are free (ignore
watchdog).
Now you may be arguing, that it may take more time to ctx_resched() then to
wait for the timer to expire. But I am not sure I buy that. Similarly,
I am not sure
there is code to cancel an active mux hrtimer when we clear rotate_necessary.
Maybe we just let it lapse and clear itself via a ctx_sched_out() in
the rotation code.

>
> > This patch addresses the cause of the problem, by checking that,
> > when an event is disabled or removed and the context was multiplexing
> > events, inactive events gets immediately a chance to be scheduled by
> > calling ctx_resched(). The rescheduling is done on event of equal
> > or lower priority types. With that in place, as soon as a counter
> > is freed, schedulable inactive events may run, thereby eliminating
> > a blind spot.
>
> Disagreed, Song's patch removed the fundamental blind spot of rotation
> completely failing.
Sure it removed the infinite blocking of schedulable events. My patch addresses
the issue of having free counters following a deactivation/removal and
not scheduling
the idling events on them, thereby creating a blind spot where no
event is monitoring.

>
> This just slightly optimizes counter usage -- at a cost of having to
> reprogram the counters more often.
>
Only on deactivation/removal AND multiplexing so it is not every time but
only where there is an opportunity to keep the counters busy.

> Not saying we shouldn't do this, but this justification is just all
> sorts of wrong.

I think the patches are not mutually exclusive. They sort of complement each
other. My problem was that if you have free counters and idling events, you may
not need to do one more rotation and wait for it to resume active
collection, especially
when the timer may be set to a value larger than default. In my case,
I debug my issue
without Song's patch. I discovered it at the time I was ready to
submit my patch to LKML.
My patch by itself solved the problem I was seeing. Song patch is
needed to solve the
problem of the first flexible event not schedulable while other could
be. My patch does
not address that. It would address the case where that ref-cycles
pinned event is removed
and the flexible ref-cycles could take the counter immediately. So, I
am okay with you saying
this is an optimization and I think it is useful.