Re: [RFC 3/6] perf/core: use rb-tree to sched in event groups
From: David Carrillo-Cisneros
Date: Fri Jan 13 2017 - 03:01:15 EST
On Thu, Jan 12, 2017 at 4:14 AM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
> On Tue, Jan 10, 2017 at 12:51:58PM -0800, David Carrillo-Cisneros wrote:
>> On Tue, Jan 10, 2017 at 8:38 AM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>> > Hi,
>> >
>> > On Tue, Jan 10, 2017 at 02:24:59AM -0800, David Carrillo-Cisneros wrote:
>> >> During sched in, only events that match the current CPU and cgroup can
>> >> be scheduled in. These events should be added to the PMU in increasing
>> >> timestamp order to guaratee fairness in the event rotation.
>> >>
>> >> In task contexts, events with no CPU affinity or with affinity to the
>> >> current CPU are eligible.
>> >> In CPU contexts, events with no cgroup restriction (CPU events) or with
>> >> a cgroup that is current (either current's task cgroup or one of its
>> >> ancestor) are eligible.
>> >>
>> >> For both context types, we need to iterate over events with different
>> >> rb-tree key groups (different CPUs or different cgroups) in timestamp
>> >> order.
>> >>
>> >> Finding the events in timestamp order is at odds with indexing by
>> >> by CPU/cgroup.
>> >
>> > That's a fair point. Sorting by CPU before runtime we'll get subtrees we
>> > won't get fairness unless we sort the events solely by runtime at
>> > sched_in time. If we sort by with runtime before CPU we'll have to skip
>> > events not targeting the current CPU when scheduling task events in. I
>> > note the latter is true today anyhow.
>>
>> That's were ctx->inactive_groups comes in. That list is sorted by runtime
>> and the rb-tree is used to skip to the part of the list that has the events
>> that matter.
>
> Is the list only sorted by runtime and not {cpu,runtime}? If it's the
> latter, I'm not sure I follow. If it's the former, sorry for the noise!
The former. List only sorted by runtime.
>
> The case I'm worried about is a set of task-bound events that have CPU
> filters. For example, if the user opens a set of task-bound events for
> any CPU:
>
> perf_event_open(attr1, pid, -1, -1, 0);
> perf_event_open(attr2, pid, -1, -1, 0);
>
> ... and also some for the same task, but limited to a specific CPU:
>
> perf_event_open(attr3, pid, 1, -1, 0);
> perf_event_open(attr4, pid, 1, -1, 0);
>
> ... if CPU is before runtime in the sort, one of these groups will
> always be considered first when scheduling, and may starve the other
> group.
>
> Today, the rotation gives us some degree of fairness here that we would
> lose.
Yes, that case is the reason to have the sorted inactive_list and
the tree. I tried to explain this in the change log of this patch. Please
see new attempt below.
>
>> > In Peter's original suggestion we didn't sort by cgroup. IIRC there was
>> > some email thread where the cgroup was considered for the sort (maybe
>> > that was *only* for cpu contexts? I'm not too familiar with cgroups),
>> > though I can't find the relevant mail, if it existed. :/
>>
>> FWIW, in this approach, we only sort by cgroup in CPU contexts, since cgroup
>> events are only installed in CPU contexts.
>
> Sure. However, I think a similar issue to the above applies when
> scheduling events where some are bound to a specific cgroup, and others
> are not.
Yes, it's addressed in the same way.
For both task and cpu contexts the idea is the same. It is:
1. For each "group" of events (either pmu/cpu or cgroup/NULL)
that is eligible to sched_in in current cpu/cgroup, find the pair of
minimum and
maximum events in timestamp order, using the rb-tree.
2. Each (min,max) pair is an interval in the ctx->inactive_groups list
(list is sorted by
timestamp only).
3. Merge all those min/maxs for all eligible cpus/cgroups and obtain
an overall (min,max).
That's the interval in the ctx->inactive_groups to use by ctx_sched_in.
The overall interval of events may contain ineligible events (i.e.
events bound to a
a CPU or cgroup that cannot be sched in), but that's ok because:
a) events are filtered using event_filter_match anyways and,
b) since events are sorted by timestamp, events that not passed
the filter move behind
in the list in each ctx switch in/out cycle. All ineligible events
are left behind after a few
cycles of ctx sw in/out. This work in amortized linear time for
the worst case. The worst
case occurs after a CPU/cgroup change. The stationary case is optimal.
Thanks,
David
>
> Thanks,
> Mark.