Re: [RFC 3/6] perf/core: use rb-tree to sched in event groups
From: Mark Rutland
Date: Thu Jan 12 2017 - 07:21:52 EST
On Tue, Jan 10, 2017 at 12:51:58PM -0800, David Carrillo-Cisneros wrote:
> On Tue, Jan 10, 2017 at 8:38 AM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > Hi,
> >
> > On Tue, Jan 10, 2017 at 02:24:59AM -0800, David Carrillo-Cisneros wrote:
> >> During sched in, only events that match the current CPU and cgroup can
> >> be scheduled in. These events should be added to the PMU in increasing
> >> timestamp order to guaratee fairness in the event rotation.
> >>
> >> In task contexts, events with no CPU affinity or with affinity to the
> >> current CPU are eligible.
> >> In CPU contexts, events with no cgroup restriction (CPU events) or with
> >> a cgroup that is current (either current's task cgroup or one of its
> >> ancestor) are eligible.
> >>
> >> For both context types, we need to iterate over events with different
> >> rb-tree key groups (different CPUs or different cgroups) in timestamp
> >> order.
> >>
> >> Finding the events in timestamp order is at odds with indexing by
> >> by CPU/cgroup.
> >
> > That's a fair point. Sorting by CPU before runtime we'll get subtrees we
> > won't get fairness unless we sort the events solely by runtime at
> > sched_in time. If we sort by with runtime before CPU we'll have to skip
> > events not targeting the current CPU when scheduling task events in. I
> > note the latter is true today anyhow.
>
> That's were ctx->inactive_groups comes in. That list is sorted by runtime
> and the rb-tree is used to skip to the part of the list that has the events
> that matter.
Is the list only sorted by runtime and not {cpu,runtime}? If it's the
latter, I'm not sure I follow. If it's the former, sorry for the noise!
The case I'm worried about is a set of task-bound events that have CPU
filters. For example, if the user opens a set of task-bound events for
any CPU:
perf_event_open(attr1, pid, -1, -1, 0);
perf_event_open(attr2, pid, -1, -1, 0);
... and also some for the same task, but limited to a specific CPU:
perf_event_open(attr3, pid, 1, -1, 0);
perf_event_open(attr4, pid, 1, -1, 0);
... if CPU is before runtime in the sort, one of these groups will
always be considered first when scheduling, and may starve the other
group.
Today, the rotation gives us some degree of fairness here that we would
lose.
> > In Peter's original suggestion we didn't sort by cgroup. IIRC there was
> > some email thread where the cgroup was considered for the sort (maybe
> > that was *only* for cpu contexts? I'm not too familiar with cgroups),
> > though I can't find the relevant mail, if it existed. :/
>
> FWIW, in this approach, we only sort by cgroup in CPU contexts, since cgroup
> events are only installed in CPU contexts.
Sure. However, I think a similar issue to the above applies when
scheduling events where some are bound to a specific cgroup, and others
are not.
Thanks,
Mark.