Re: [RFC] perf: a different approach to perf_rotate_context()

From: Jiri Olsa
Date: Sat Mar 03 2018 - 08:52:13 EST


On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote:
> When there are more perf_event's than hardware PMCs, perf rotate events
> so that all events get chance to run. Currently, the rotation works as:
> sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
> rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
> try sched_in flexible_groups in cpuctx->ctx;
> try sched_in flexible_groups in cpuctx->task_ctx.
>
> This approach has some potential issues:
> 1. if different rotations of flexible_groups in cpuctx->ctx occupy
> all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run
> at all.
> 2. if pinned_groups occupy all hardware PMC, the rotation triggers per
> perf_event_mux_interval_ms. But it couldn't schedule any events.
> 3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are
> rotated separately, there are N x M possible combinations. It is
> difficult to remember all the rotation combinations and reuse these
> combinations. As a result, it is necessary to try sched_in the
> flexible_groups on each rotation.
>
> This patch tries to do the rotation differently. Each perf_event in the
> cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's
> are assigned during the first few rotations after any changes in
> perf_events attached to the cpuctx. Once all the rotation_id's are
> assigned for all events in the cpuctx, perf_rotate_context() simply
> picks the next rotation to use, so there is no more "try to sched_in"
> for future rotations.
>
> Special rotation_id's are introduced to handle the issues above.
> flexible_groups that conflicts with pinned_groups are marked as
> ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups
> in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get
> equal chance to run (improves issue 1).

hum, so the improvement is that cpuctx could eventually give
up some space for task_ctx events, but both ctxs still rotate
separately no? you keep rotation ID per single context..

>
> With this approach, we only do complex scheduling of flexible_groups
> once. This enables us to do more complex schduling, for example, Sharing
> PMU counters across compatible events:
> https://lkml.org/lkml/2017/12/1/410.

how could this code do that?

>
> There are also some potential downsides of this approach.
>
> First, it gives all flexible_groups exactly same chance to run, so it
> may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned
> to two rotations: rotation-0: ABCD and rotation-1: E, this approach will
> NOT try any of ABCD in rotation-1.
>
> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have
> exact same priority and equal chance to run. I am not sure whether this
> will change the behavior in some use cases.
>
> Please kindly let me know whether this approach makes sense.

SNIP

> +/*
> + * identify always_on and always_off groups in flexible_groups, call
> + * group_sched_in() for always_on groups
> + */
> +static void ctx_pick_always_on_off_groups(struct perf_event_context *ctx,
> + struct perf_cpu_context *cpuctx)
> +{
> + struct perf_event *event;
> +
> + list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
> + if (event->group_caps & PERF_EV_CAP_SOFTWARE) {
> + event->rotation_id = PERF_ROTATION_ID_ALWAYS_ON;
> + ctx->nr_sched++;
> + WARN_ON(group_sched_in(event, cpuctx, ctx));
> + continue;
> + }
> + if (group_sched_in(event, cpuctx, ctx)) {
> + event->rotation_id = PERF_ROTATION_ID_ALWAYS_OFF;
> + ctx->nr_sched++;

should ctx->nr_sched be incremented in the 'else' leg?

> + }
> + group_sched_out(event, cpuctx, ctx);
> + }
> +}
> +
> +/* add unassigned flexible_groups to new rotation_id */
> +static void ctx_add_rotation(struct perf_event_context *ctx,
> + struct perf_cpu_context *cpuctx)
> {
> struct perf_event *event;
> + int group_added = 0;
> int can_add_hw = 1;
>
> + ctx->curr_rotation = ctx->num_rotations;
> + ctx->num_rotations++;
> +
> list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
> /* Ignore events in OFF or ERROR state */
> if (event->state <= PERF_EVENT_STATE_OFF)
> @@ -3034,13 +3099,77 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
> if (!event_filter_match(event))
> continue;
>
> + if (event->rotation_id != PERF_ROTATION_ID_NOT_ASSGINED)
> + continue;
> +
> if (group_can_go_on(event, cpuctx, can_add_hw)) {
> if (group_sched_in(event, cpuctx, ctx))
> can_add_hw = 0;
> + else {
> + event->rotation_id = ctx->curr_rotation;
> + ctx->nr_sched++;
> + group_added++;

group_added is not used

SNIP

> static int perf_rotate_context(struct perf_cpu_context *cpuctx)
> {
> - struct perf_event_context *ctx = NULL;
> + struct perf_event_context *ctx = cpuctx->task_ctx;
> int rotate = 0;
> + u64 now;
>
> - if (cpuctx->ctx.nr_events) {
> - if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
> - rotate = 1;
> - }
> -
> - ctx = cpuctx->task_ctx;
> - if (ctx && ctx->nr_events) {
> - if (ctx->nr_events != ctx->nr_active)
> - rotate = 1;
> - }
> + if (!flexible_sched_done(cpuctx) ||
> + cpuctx->ctx.num_rotations > 1)
> + rotate = 1;
>
> if (!rotate)
> goto done;
> @@ -3382,15 +3492,35 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
> perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> perf_pmu_disable(cpuctx->ctx.pmu);
>
> - cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> + update_context_time(&cpuctx->ctx);
> if (ctx)
> - ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
> + update_context_time(ctx);
> + update_cgrp_time_from_cpuctx(cpuctx);
>
> - rotate_ctx(&cpuctx->ctx);
> + ctx_switch_rotation_out(&cpuctx->ctx, cpuctx);
> if (ctx)
> - rotate_ctx(ctx);
> + ctx_switch_rotation_out(ctx, cpuctx);
>
> - perf_event_sched_in(cpuctx, ctx, current);
> + if (flexible_sched_done(cpuctx)) {
> + /* simply repeat previous calculated rotations */
> + ctx_switch_rotation_in(&cpuctx->ctx, cpuctx);
> + if (ctx)
> + ctx_switch_rotation_in(ctx, cpuctx);
> + } else {
> + /* create new rotation */
> + ctx_add_rotation(&cpuctx->ctx, cpuctx);
> + if (ctx)
> + ctx_add_rotation(ctx, cpuctx);
> + }

that seems messy.. could this be done just by setting
the rotation ID and let perf_event_sched_in skip over
different IDs and sched-in/set un-assigned events?

jirka