Re: [RFC 3/6] perf/core: use rb-tree to sched in event groups
From: David Carrillo-Cisneros
Date: Tue Jan 10 2017 - 15:52:03 EST
On Tue, Jan 10, 2017 at 8:38 AM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
> Hi,
>
> On Tue, Jan 10, 2017 at 02:24:59AM -0800, David Carrillo-Cisneros wrote:
>> During sched in, only events that match the current CPU and cgroup can
>> be scheduled in. These events should be added to the PMU in increasing
>> timestamp order to guaratee fairness in the event rotation.
>>
>> In task contexts, events with no CPU affinity or with affinity to the
>> current CPU are eligible.
>> In CPU contexts, events with no cgroup restriction (CPU events) or with
>> a cgroup that is current (either current's task cgroup or one of its
>> ancestor) are eligible.
>>
>> For both context types, we need to iterate over events with different
>> rb-tree key groups (different CPUs or different cgroups) in timestamp
>> order.
>>
>> Finding the events in timestamp order is at odds with indexing by
>> by CPU/cgroup.
>
> That's a fair point. Sorting by CPU before runtime we'll get subtrees we
> won't get fairness unless we sort the events solely by runtime at
> sched_in time. If we sort by with runtime before CPU we'll have to skip
> events not targeting the current CPU when scheduling task events in. I
> note the latter is true today anyhow.
That's were ctx->inactive_groups comes in. That list is sorted by runtime
and the rb-tree is used to skip to the part of the list that has the events
that matter.
>
> In Peter's original suggestion we didn't sort by cgroup. IIRC there was
> some email thread where the cgroup was considered for the sort (maybe
> that was *only* for cpu contexts? I'm not too familiar with cgroups),
> though I can't find the relevant mail, if it existed. :/
FWIW, in this approach, we only sort by cgroup in CPU contexts, since cgroup
events are only installed in CPU contexts.
>
> Peter, did you have an idea as to how to handle that, or am I imagining
> things here?
>
> Kan, in your per-cpu event list patch you mentioned that you saw a large
> overhead in perf_iterate_ctx() when skipping events for other CPUs.
> Which callers of perf_iterate_ctx() specifically was that problematic
> for? Do those callers only care about the *active* events, for example?
>
> Maybe the overhead of skipping !current_cpu events is ok at sched_in
> time in most cases. If the overhead of skipping those only matters for a
> subset of perf_iterate_ctx() callers, then maybe we can optimise them in
> another fashion (e.g. use the active events lists, or a new list
> specific to that iterate user, depending on what they actually need).
> That way we can drop cpu from the sort.
>
>> The rb-tree allows us to find events with minimum and
>> maximum timestamp for a given CPU/cgroup + flexible type. The list
>> ctx->inactive_groups is sorted by timestamp.
>>
>> We could find a list position for the first event of each CPU/cgroup that
>> is to be scheduled and iterate over all of them, selecting events from
>> the list's head with the smallest timestampt, but it's too complicated.
>>
>> A simpler alternative is to find the smallest subinterval of
>> ctx->inactive_groups that contains all eligible events. Let's call this
>> minimum subinterval S.
>>
>> S is formed of smaller subintervals, no necessarily exclusive, intervals.
>> Each one has all the events that are eligible for a given CPU or cgroup.
>> We find S by searching for the start/end of each one of these CPU/cgroup
>> subintervals and combining them. The drawback is that there may be
>> events in S that are not eligible (since ctx->inactive_group is in stamp
>> order).
>
> The other drawback is that this is not fair, since CPU comes before
> runtime in the sort order. You'll always try some events before others
> (e.g. cpu == -1 before cpu == current), before considering runtime. I
> believe this means that events can be permanently starved.
>
> So either we need to fold those together somehow, or drop CPU from the
> sort order (assuming that we can, as above).
We do that by iterating over ctx->inactive_groups, that is sorted by timestamp.
That's how we keep the fairness.
>
> Thanks,
> Mark.