Re: [RFC 2/6] perf/core: add a rb-tree index to inactive_groups

From: Mark Rutland
Date: Tue Jan 10 2017 - 09:19:09 EST


On Tue, Jan 10, 2017 at 02:24:58AM -0800, David Carrillo-Cisneros wrote:
> Add a rb-tree that indexes inactive events by {CPU/cgroup,flexible,stamp}.
>
> The original idea by Peter Z. was to sort task events in an rb-tree using
> {pmu,cpu,timestamp} as key.
>
> Having the PMU as part of the key gets complicated for contexts that
> share pmus (i.e. software context) because all events in a context should
> rotate together irrespective of their pmu. It's also unclear to me that
> there is any case where a seach by pmu is useful.

Using the PMU as the key is essential for correct operation where
heterogeneous HW PMUs share a context.

For example, on a big.LITTLE system, big and little CPU PMUs share the
same context, but their events are mutually incompatible. On big CPUs we
only want to consider the sub-tree of big events, and on little CPUs we
only want to consider little events. Hence, we need to be abel to search
by PMU.

For SW PMUs, pmu::add() should never fail, and regardless of the order
of the list we should be able to pmu::add() all events. Given that, why
does the manner in which rotation occurs matter for SW PMUs?

> Another complicatino is that using ctx->time (or timestamp) implies that
> groups added during the same context switch may not have unique key.
> This increases the complexity of that finds all events in the rb-tree
> that are within a time interval.

Could you elaborate on this? I don't understand what the problem is
here. If we need uniqueness where {pmu,cpu,runtime} are equal, can't we
extend the comparison to {pmu,cpu,runtime,event pointer}? That way
everything we need is already implicit in the event, and we don't need
perf_event::rbtree_key nor do we need
perf_event_context::nr_inactive_added.

Using the runtime as part of the key was intended to provide fairer
scheduling, even when scheduling intervals are not uniform, etc. Not
using the runtime means we lose that fairness.

> Lastly, it is useful to query pinned and flexible events separately since
> they are scheduled in at different times.

Using this as part of the sort makes sense to me.

> For the reasons above, I created a rb-tree per context with key
> {CPU,flexible,stamp} for task contexts and {cgroup,flexible,stamp} for
> CPU contexts.
>
> The "flexible" boolean allows to query pinned or flexible events
> separately.
> The stamp is given by a non-decreasing counter: ctx->nr_events_added.
> It increases every time a new inactive event is inserted. That choice of
> stamp guarantees unique keys for all events and that events of the same
> type (same {CPU/cgroup,flexible}) have the same order in the rb-tree.

As above, I think we can use the event pointer for this, without having
to maintain a new counter.

> When events are scheduled in or rotated, all events in the context must be
> iterated or rotated together, irrespective of the CPU/cgroup. To do that,
> we add ctx->inactive_groups, a list that "threads" the rb-tree in total
> ctx->nr_events_added order. Note that this order is the same as timestamp
> order and ctx->inactive_groups is used for both scheduling and iteration.
> The rb-tree can be seen as an index over ctx->inactive_groups.

As above, I think we must use runtime as part of the key, and with that
done, we only need the rb tree, and not the inactive_groups list.

With runtime as part of the key, rotation is not necessary, since
insertion will sort the list, and then we can pick all the nodes with
the shortest runtime. Hence, no special rotation code is necessary -- we
simply make all events inactive, then reschedule.

> Signed-off-by: David Carrillo-Cisneros <davidcc@xxxxxxxxxx>
> ---
> include/linux/perf_event.h | 5 +++
> kernel/events/core.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 87 insertions(+)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3fa18f05c9b0..fd32ecc37d33 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -564,6 +564,9 @@ struct perf_event {
> struct list_head group_entry;
> struct list_head sibling_list;
>
> + u64 rbtree_key;
> + struct rb_node rbtree_node;
> +
> /*
> * We need storage to track the entries in perf_pmu_migrate_context; we
> * cannot use the event_entry because of RCU and we want to keep the
> @@ -736,6 +739,8 @@ struct perf_event_context {
> struct list_head pinned_groups;
> struct list_head flexible_groups;
>
> + struct rb_root rbtree_root;
> + u32 nr_inactive_added;
> struct list_head active_pinned_groups;
> struct list_head active_flexible_groups;
> struct list_head inactive_groups;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b744b5a8dbd0..623d81c0ca93 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1462,19 +1462,98 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
> return &ctx->flexible_groups;
> }
>
> +/*
> + * The bits perf_event::kbtree_key represent:
> + * - 63:33 an unique identifier for CPU (if a task context) or a cgroup
> + * (if a CPU context).
> + * - 32 a boolean to indicate if eventt is flexible (vs pinnned).
> + * - 31:0 a unique "stamp" that follows the last time the event was
> + * scheduled.
> + * The 64 bits value groups event of the same type (CPU/cgroup + flexible)
> + * together in the rb-tree.
> + */
> +#define RBTREE_KEY_STAMP_WIDTH 32
> +#define RBTREE_KEY_STAMP_MASK GENMASK_ULL(RBTREE_KEY_STAMP_WIDTH - 1, 0)
> +#define RBTREE_KEY_FLEXIBLE_MASK BIT_ULL(RBTREE_KEY_STAMP_WIDTH)
> +
> +static u64 taskctx_rbtree_key(int cpu, bool flexible)
> +{
> + /*
> + * Use CPU only. PMU is never used in schedule in/out and, since some
> + * contexts share PMU, iterate over them would make things complicated.
> + * I could not find a case where an ordered iteration over all PMU
> + * events in one context is useful.
> + */
> + return ((u64)cpu << (RBTREE_KEY_STAMP_WIDTH + 1)) |
> + (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
> +}
> +
> +static u64 cpuctx_rbtree_key(struct perf_cgroup *cgrp, bool flexible)
> +{
> + u64 k;
> +
> + if (cgrp)
> + /* A cheap way to obtain an identifier for a cgroup. Suggestions appreciated. */
> + k = (u64)cgrp->css.id << (RBTREE_KEY_STAMP_WIDTH + 1);
> + else
> + k = GENMASK_ULL(63, RBTREE_KEY_STAMP_WIDTH + 1);
> + return k | (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
> +}

Can we turn these info {task,cpu}ctx_rbtree_cmp() functions which
compare two events dynamically?

Given that event->task can distinguish the two cases, we could just have
a event_rbtree_cmp() (with an assert that event_a->task ==
event_b->task).

Thanks,
Mark.