Re: [RFC 3/6] perf/core: use rb-tree to sched in event groups
From: Mark Rutland
Date: Tue Jan 10 2017 - 11:40:03 EST
Hi,
On Tue, Jan 10, 2017 at 02:24:59AM -0800, David Carrillo-Cisneros wrote:
> During sched in, only events that match the current CPU and cgroup can
> be scheduled in. These events should be added to the PMU in increasing
> timestamp order to guaratee fairness in the event rotation.
>
> In task contexts, events with no CPU affinity or with affinity to the
> current CPU are eligible.
> In CPU contexts, events with no cgroup restriction (CPU events) or with
> a cgroup that is current (either current's task cgroup or one of its
> ancestor) are eligible.
>
> For both context types, we need to iterate over events with different
> rb-tree key groups (different CPUs or different cgroups) in timestamp
> order.
>
> Finding the events in timestamp order is at odds with indexing by
> by CPU/cgroup.
That's a fair point. Sorting by CPU before runtime we'll get subtrees we
won't get fairness unless we sort the events solely by runtime at
sched_in time. If we sort by with runtime before CPU we'll have to skip
events not targeting the current CPU when scheduling task events in. I
note the latter is true today anyhow.
In Peter's original suggestion we didn't sort by cgroup. IIRC there was
some email thread where the cgroup was considered for the sort (maybe
that was *only* for cpu contexts? I'm not too familiar with cgroups),
though I can't find the relevant mail, if it existed. :/
Peter, did you have an idea as to how to handle that, or am I imagining
things here?
Kan, in your per-cpu event list patch you mentioned that you saw a large
overhead in perf_iterate_ctx() when skipping events for other CPUs.
Which callers of perf_iterate_ctx() specifically was that problematic
for? Do those callers only care about the *active* events, for example?
Maybe the overhead of skipping !current_cpu events is ok at sched_in
time in most cases. If the overhead of skipping those only matters for a
subset of perf_iterate_ctx() callers, then maybe we can optimise them in
another fashion (e.g. use the active events lists, or a new list
specific to that iterate user, depending on what they actually need).
That way we can drop cpu from the sort.
> The rb-tree allows us to find events with minimum and
> maximum timestamp for a given CPU/cgroup + flexible type. The list
> ctx->inactive_groups is sorted by timestamp.
>
> We could find a list position for the first event of each CPU/cgroup that
> is to be scheduled and iterate over all of them, selecting events from
> the list's head with the smallest timestampt, but it's too complicated.
>
> A simpler alternative is to find the smallest subinterval of
> ctx->inactive_groups that contains all eligible events. Let's call this
> minimum subinterval S.
>
> S is formed of smaller subintervals, no necessarily exclusive, intervals.
> Each one has all the events that are eligible for a given CPU or cgroup.
> We find S by searching for the start/end of each one of these CPU/cgroup
> subintervals and combining them. The drawback is that there may be
> events in S that are not eligible (since ctx->inactive_group is in stamp
> order).
The other drawback is that this is not fair, since CPU comes before
runtime in the sort order. You'll always try some events before others
(e.g. cpu == -1 before cpu == current), before considering runtime. I
believe this means that events can be permanently starved.
So either we need to fold those together somehow, or drop CPU from the
sort order (assuming that we can, as above).
Thanks,
Mark.