Re: [RFC 2/6] perf/core: add a rb-tree index to inactive_groups

From: David Carrillo-Cisneros
Date: Tue Jan 10 2017 - 15:20:10 EST


On Tue, Jan 10, 2017 at 6:14 AM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
> On Tue, Jan 10, 2017 at 02:24:58AM -0800, David Carrillo-Cisneros wrote:
>> Add a rb-tree that indexes inactive events by {CPU/cgroup,flexible,stamp}.
>>
>> The original idea by Peter Z. was to sort task events in an rb-tree using
>> {pmu,cpu,timestamp} as key.
>>
>> Having the PMU as part of the key gets complicated for contexts that
>> share pmus (i.e. software context) because all events in a context should
>> rotate together irrespective of their pmu. It's also unclear to me that
>> there is any case where a seach by pmu is useful.
>
> Using the PMU as the key is essential for correct operation where
> heterogeneous HW PMUs share a context.
>
> For example, on a big.LITTLE system, big and little CPU PMUs share the
> same context, but their events are mutually incompatible. On big CPUs we
> only want to consider the sub-tree of big events, and on little CPUs we
> only want to consider little events. Hence, we need to be abel to search
> by PMU.

I see it now. So, if PMU were added to the rb-tree keys. How can the
generic code know what's the PMU of the current CPU?

>
> For SW PMUs, pmu::add() should never fail, and regardless of the order
> of the list we should be able to pmu::add() all events. Given that, why
> does the manner in which rotation occurs matter for SW PMUs?
>
>> Another complicatino is that using ctx->time (or timestamp) implies that
>> groups added during the same context switch may not have unique key.
>> This increases the complexity of that finds all events in the rb-tree
>> that are within a time interval.
>
> Could you elaborate on this? I don't understand what the problem is
> here. If we need uniqueness where {pmu,cpu,runtime} are equal, can't we
> extend the comparison to {pmu,cpu,runtime,event pointer}? That way
> everything we need is already implicit in the event, and we don't need
> perf_event::rbtree_key nor do we need
> perf_event_context::nr_inactive_added.

Yes, we could extend the comparison. But I am trying to keep the key a
u64 to speed up things.

I found it easier to simply create a counter and use it as an equivalent to
(timestamp, unique id). Both ways induce the same order of events.

>
> Using the runtime as part of the key was intended to provide fairer
> scheduling, even when scheduling intervals are not uniform, etc. Not
> using the runtime means we lose that fairness.

A counter that increases every time an event is added to the rbtree
induces the same order as a (timestamp, <some unique id>).

>
>> Lastly, it is useful to query pinned and flexible events separately since
>> they are scheduled in at different times.
>
> Using this as part of the sort makes sense to me.
>
>> For the reasons above, I created a rb-tree per context with key
>> {CPU,flexible,stamp} for task contexts and {cgroup,flexible,stamp} for
>> CPU contexts.
>>
>> The "flexible" boolean allows to query pinned or flexible events
>> separately.
>> The stamp is given by a non-decreasing counter: ctx->nr_events_added.
>> It increases every time a new inactive event is inserted. That choice of
>> stamp guarantees unique keys for all events and that events of the same
>> type (same {CPU/cgroup,flexible}) have the same order in the rb-tree.
>
> As above, I think we can use the event pointer for this, without having
> to maintain a new counter.

Same as above. I am trying to have a unique 64 bit key. I'd argue that
the new counter only adds a couple of lines of code and 32 bits of storage
but makes the rb-tree key comparison simpler and faster.

>
>> When events are scheduled in or rotated, all events in the context must be
>> iterated or rotated together, irrespective of the CPU/cgroup. To do that,
>> we add ctx->inactive_groups, a list that "threads" the rb-tree in total
>> ctx->nr_events_added order. Note that this order is the same as timestamp
>> order and ctx->inactive_groups is used for both scheduling and iteration.
>> The rb-tree can be seen as an index over ctx->inactive_groups.
>
> As above, I think we must use runtime as part of the key, and with that
> done, we only need the rb tree, and not the inactive_groups list.

The inactive_groups list makes the rb-tree "threaded" and follows
timestamp (or stamp) order. It makes iterating over events much
faster.

>
> With runtime as part of the key, rotation is not necessary, since
> insertion will sort the list, and then we can pick all the nodes with
> the shortest runtime. Hence, no special rotation code is necessary -- we
> simply make all events inactive, then reschedule.

Right, that's done later in this series.

>
>> Signed-off-by: David Carrillo-Cisneros <davidcc@xxxxxxxxxx>
>> ---
>> include/linux/perf_event.h | 5 +++
>> kernel/events/core.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 87 insertions(+)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 3fa18f05c9b0..fd32ecc37d33 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -564,6 +564,9 @@ struct perf_event {
>> struct list_head group_entry;
>> struct list_head sibling_list;
>>
>> + u64 rbtree_key;
>> + struct rb_node rbtree_node;
>> +
>> /*
>> * We need storage to track the entries in perf_pmu_migrate_context; we
>> * cannot use the event_entry because of RCU and we want to keep the
>> @@ -736,6 +739,8 @@ struct perf_event_context {
>> struct list_head pinned_groups;
>> struct list_head flexible_groups;
>>
>> + struct rb_root rbtree_root;
>> + u32 nr_inactive_added;
>> struct list_head active_pinned_groups;
>> struct list_head active_flexible_groups;
>> struct list_head inactive_groups;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index b744b5a8dbd0..623d81c0ca93 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -1462,19 +1462,98 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
>> return &ctx->flexible_groups;
>> }
>>
>> +/*
>> + * The bits perf_event::kbtree_key represent:
>> + * - 63:33 an unique identifier for CPU (if a task context) or a cgroup
>> + * (if a CPU context).
>> + * - 32 a boolean to indicate if eventt is flexible (vs pinnned).
>> + * - 31:0 a unique "stamp" that follows the last time the event was
>> + * scheduled.
>> + * The 64 bits value groups event of the same type (CPU/cgroup + flexible)
>> + * together in the rb-tree.
>> + */
>> +#define RBTREE_KEY_STAMP_WIDTH 32
>> +#define RBTREE_KEY_STAMP_MASK GENMASK_ULL(RBTREE_KEY_STAMP_WIDTH - 1, 0)
>> +#define RBTREE_KEY_FLEXIBLE_MASK BIT_ULL(RBTREE_KEY_STAMP_WIDTH)
>> +
>> +static u64 taskctx_rbtree_key(int cpu, bool flexible)
>> +{
>> + /*
>> + * Use CPU only. PMU is never used in schedule in/out and, since some
>> + * contexts share PMU, iterate over them would make things complicated.
>> + * I could not find a case where an ordered iteration over all PMU
>> + * events in one context is useful.
>> + */
>> + return ((u64)cpu << (RBTREE_KEY_STAMP_WIDTH + 1)) |
>> + (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
>> +}
>> +
>> +static u64 cpuctx_rbtree_key(struct perf_cgroup *cgrp, bool flexible)
>> +{
>> + u64 k;
>> +
>> + if (cgrp)
>> + /* A cheap way to obtain an identifier for a cgroup. Suggestions appreciated. */
>> + k = (u64)cgrp->css.id << (RBTREE_KEY_STAMP_WIDTH + 1);
>> + else
>> + k = GENMASK_ULL(63, RBTREE_KEY_STAMP_WIDTH + 1);
>> + return k | (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
>> +}
>
> Can we turn these info {task,cpu}ctx_rbtree_cmp() functions which
> compare two events dynamically?
>
> Given that event->task can distinguish the two cases, we could just have
> a event_rbtree_cmp() (with an assert that event_a->task ==
> event_b->task).

I tried to avoid recreating the key every time there is a comparison.
Currently the key is stored into the u64 event->rbtree_key and only recreated
when the stamp changes.

The cmp wrapper makes sense if we go away from the u64 keys and add
more fields to the comparison.

In summary:

- The stamp (i.e. ctx->nr_inactive_added) induces the same order than the
timestamp at the time the event is added to the rb-tree. But it guarantees
uniqueness and is denser (fits in 32bits).

- The inactive_groups list threads the rb-tree and is sorted in stamp order
(same as timestamp order). This is the list used to iterate during
ctx_sched_{in,out}.
The rb-tree only creates a index (or skiplist) on these list so that we skip
most of the events that do not mach the current CPU/cgroup (or pmu,
optionally).

- I see now why PMU is useful for big.LITTLE. If there is a way for
generic code
to retrieve the PMU of a CPU, I can add it to the key for those PMUs with
PERF_PMU_CAP_HETEROGENEOUS_CPUS capability.

>
> Thanks,
> Mark.