RE: [RFC 0/6] optimize ctx switch with rb-tree

From: Budankov, Alexey
Date: Tue Apr 25 2017 - 14:12:27 EST

Next message: Tejun Heo: "Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity"
Previous message: Fabian Frederick: "[PATCH 1/1 linux-next] btrfs: kmap() can't fail"
In reply to: David Carrillo-Cisneros: "Re: [RFC 0/6] optimize ctx switch with rb-tree"
Next in thread: David Carrillo-Cisneros: "Re: [RFC 0/6] optimize ctx switch with rb-tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi David,

I profiled single threaded STREAM benchmark pinned to core #1 on KNL machine with kernel 3.10.0-327.13.1.el7.x86_64 (from MPSS distribution). I run the system-wide and the per-process profiling for the same list of events and got ~5x slowdown for the per-process scenario.
Simultaneously I was collecting ftrace logs for the both cases and found that there is huge slowdown in handling of multiplexing timer interrupt for the per-process case in comparison to the system-wide one:

begin=>stream-22736 [001] d.h. 5409.272489: perf_cpu_hrtimer_handler <-__hrtimer_run_queues
stream-22736 [001] d.h. 5409.272490: perf_pmu_disable <-perf_cpu_hrtimer_handler
stream-22736 [001] d.h. 5409.272491: intel_pmu_disable_all <-x86_pmu_disable
stream-22736 [001] d.h. 5409.272492: intel_pmu_pebs_disable_all <-intel_pmu_disable_all
stream-22736 [001] d.h. 5409.272492: intel_pmu_lbr_disable_all <-intel_pmu_disable_all
stream-22736 [001] d.h. 5409.272494: perf_pmu_disable <-ctx_sched_out_2.constprop.85
stream-22736 [001] d.h. 5409.272494: perf_pmu_enable <-ctx_sched_out_2.constprop.85
stream-22736 [001] d.h. 5409.272495: perf_pmu_disable <-ctx_sched_out_2.constprop.85
stream-22736 [001] d.h. 5409.272506: perf_pmu_disable <-event_sched_out.isra.72
stream-22736 [001] d.h. 5409.272507: intel_pmu_disable_event <-x86_pmu_stop
stream-22736 [001] d.h. 5409.272509: intel_pmu_pebs_disable <-intel_pmu_disable_event
stream-22736 [001] d.h. 5409.272511: perf_event_update_userpage <-x86_pmu_del
=>stream-22736 [001] d.h. 5409.272511: perf_pmu_enable <-event_sched_out.isra.72
+2ms=>stream-22736 [001] d.h. 5409.274445: perf_pmu_enable <-ctx_sched_out_2.constprop.85
stream-22736 [001] d.h. 5409.274446: rotate_ctx.part.41 <-perf_cpu_hrtimer_handler
stream-22736 [001] d.h. 5409.274447: rotate_ctx.part.41 <-perf_cpu_hrtimer_handler
stream-22736 [001] d.h. 5409.274454: perf_pmu_disable <-x86_pmu_start_txn
stream-22736 [001] d.h. 5409.274455: perf_pmu_disable <-event_sched_in.isra.74
stream-22736 [001] d.h. 5409.274457: perf_pmu_enable <-event_sched_in.isra.74
stream-22736 [001] d.h. 5409.274459: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274460: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274461: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274462: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274462: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274463: perf_pmu_enable <-x86_pmu_commit_txn
stream-22736 [001] d.h. 5409.274491: perf_pmu_disable <-x86_pmu_start_txn
stream-22736 [001] d.h. 5409.274492: perf_pmu_disable <-event_sched_in.isra.74
stream-22736 [001] d.h. 5409.274492: perf_pmu_enable <-event_sched_in.isra.74
stream-22736 [001] d.h. 5409.274493: perf_pmu_enable <-x86_pmu_cancel_txn
=>stream-22736 [001] d.h. 5409.274493: perf_cpu_hrtimer_restart <-group_sched_in
+600us=>stream-22736 [001] d.h. 5409.275146: perf_pmu_enable <-perf_cpu_hrtimer_handler
stream-22736 [001] d.h. 5409.275148: perf_event_update_userpage <-x86_perf_event_set_period
stream-22736 [001] d.h. 5409.275148: intel_pmu_enable_event <-x86_pmu_start
stream-22736 [001] d.h. 5409.275149: intel_pmu_pebs_enable <-intel_pmu_enable_event
stream-22736 [001] d.h. 5409.275150: perf_event_update_userpage <-x86_pmu_start
stream-22736 [001] d.h. 5409.275151: intel_pmu_enable_all <-x86_pmu_enable
stream-22736 [001] d.h. 5409.275152: intel_pmu_pebs_enable_all <-intel_pmu_enable_all
end=>stream-22736 [001] d.h. 5409.275152: intel_pmu_lbr_enable_all <-intel_pmu_enable_all
stream-22736 [001] d.h. 5409.275176: perf_event_task_tick <-scheduler_tick
stream-22736 [001] .... 5409.275217: perf_event_mmap_output <-perf_event_aux_ctx
stream-22736 [001] d.h. 5409.275233: intel_pmu_handle_irq <-perf_event_nmi_handler
stream-22736 [001] d.h. 5409.275234: intel_pmu_disable_all <-intel_pmu_handle_irq

I looked at the source code and see that in the per-process case the whole list of allocated events (29 x 272) is traversed twice on every timer interrupt during that 2ms and 600us intervals. However the only 29 per-cpu events are iterated in the system-wide case:

kernel/events/core.c:2799:

static void ctx_sched_out(struct perf_event_context *ctx,
struct perf_cpu_context *cpuctx,
enum event_type_t event_type)
{
#ifdef PERF___ENABLE_CTX_ROT
struct perf_event *event;
int is_active = ctx->is_active;
#endif

ctx->is_active &= ~event_type;
if (likely(!ctx->nr_events))
return;

update_context_time(ctx);
update_cgrp_time_from_cpuctx(cpuctx);
if (!ctx->nr_active)
return;

perf_pmu_disable(ctx->pmu);
#ifdef PERF___ENABLE_CTX_ROT
if ((is_active & EVENT_PINNED) && (event_type & EVENT_PINNED)) {
list_for_each_entry(event, &ctx->pinned_groups, group_entry)
group_sched_out(event, cpuctx, ctx);
}

if ((is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE)) {
list_for_each_entry(event, &ctx->flexible_groups, group_entry)
group_sched_out(event, cpuctx, ctx);
}
#endif
perf_pmu_enable(ctx->pmu);
}

kernel/events/core.c:2384

static void
ctx_sched_in(struct perf_event_context *ctx,
struct perf_cpu_context *cpuctx,
enum event_type_t event_type,
struct task_struct *task)
{
u64 now;
#ifdef PERF___ENABLE_CTX_ROT
int is_active = ctx->is_active;
ctx->is_active |= event_type;
#endif
if (likely(!ctx->nr_events))
return;

now = perf_clock();
ctx->timestamp = now;
perf_cgroup_set_timestamp(task, ctx);
/*
* First go through the list and put on any pinned groups
* in order to give them the best chance of going on.
*/
#ifdef PERF___ENABLE_CTX_ROT
if (!(is_active & EVENT_PINNED) && (event_type & EVENT_PINNED))
ctx_pinned_sched_in(ctx, cpuctx);

/* Then walk through the lower prio flexible groups */
if (!(is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE))
ctx_flexible_sched_in(ctx, cpuctx);
#endif
}

If I disable traversing in the per-process case then the overhead disappears.

For the system-wide case the ctx->pinned_groups and ctx->flexible_groups lists are parts of per-cpu perf_cpu_context object and count of iterations is small (#events == 29).

Thanks,
Alexey

-----Original Message-----
From: David Carrillo-Cisneros [mailto:davidcc@xxxxxxxxxx]
Sent: Tuesday, April 25, 2017 8:49 PM
To: Liang, Kan <kan.liang@xxxxxxxxx>
Cc: linux-kernel@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; Ingo Molnar <mingo@xxxxxxxxxx>; Thomas Gleixner <tglx@xxxxxxxxxxxxx>; Andi Kleen <ak@xxxxxxxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>; Borislav Petkov <bp@xxxxxxx>; Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>; Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>; Vikas Shivappa <vikas.shivappa@xxxxxxxxxxxxxxx>; Mark Rutland <mark.rutland@xxxxxxx>; Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>; Vince Weaver <vince@xxxxxxxxxx>; Paul Turner <pjt@xxxxxxxxxx>; Stephane Eranian <eranian@xxxxxxxxxx>; Budankov, Alexey <alexey.budankov@xxxxxxxxx>
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

Hi Kan,

It's still on my list, but I won't have time to work on it for at least another month.

What issues did you encounter?

Thanks,
David

On Tue, Apr 25, 2017 at 10:27 AM, Liang, Kan <kan.liang@xxxxxxxxx> wrote:
> Hi David,
>
> Is there any update about the patch series?
>
> We recently encountered another performance issue on KNL. I think the RB-tree solution also has benefits for it.
>
> Thanks,
> Kan
>
>> Subject: [RFC 0/6] optimize ctx switch with rb-tree
>>
>> Following the discussion in:
>> https://patchwork.kernel.org/patch/9420035/
>>
>> This is is an early version of a series of perf context switches optimizations.
>>
>> The main idea is to create and maintain a list of inactive events
>> sorted by timestamp, and a rb-tree index to index it. The rb-tree's
>> key are {cpu,flexible,stamp} for task contexts and
>> {cgroup,flexible,stamp} for CPU contexts.
>>
>> The rb-tree provides functions to find intervals in the inactive
>> event list so that ctx_sched_in only has to visit the events that can
>> be potentially be scheduled (i.e. avoid iterations over events bound
>> to CPUs or cgroups that are not current).
>>
>> Since the inactive list is sort by timestamp, rotation can be done by
>> simply scheduling out and in the events. This implies that each timer
>> interrupt, the events will rotate by q events (where q is the number of hardware counters).
>> This changes the current behavior of rotation.
>> Feedback welcome!
>>
>> I haven't profiled the new approach. I am only assuming it will be
>> superior when the number of per-cpu or distict cgroup events is large.
>>
>> The last patch shows how perf_iterate_ctx can use the new rb-tree
>> index to reduce the number of visited events. I haven't looked
>> carefully if locking and other things are correct.
>>
>> If this changes are in the right direction. A next version could
>> remove some existing code, specifically the lists ctx->pinned_groups
>> and
>> ctx->flexible_groups could be removed. Also, event_filter_match could
>> ctx->be
>> simplified when called on events groups filtered using the rb-tree,
>> since both perform similar checks.
>>
>> David Carrillo-Cisneros (6):
>> perf/core: create active and inactive event groups
>> perf/core: add a rb-tree index to inactive_groups
>> perf/core: use rb-tree to sched in event groups
>> perf/core: avoid rb-tree traversal when no inactive events
>> perf/core: rotation no longer neccesary. Behavior has changed. Beware
>> perf/core: use rb-tree index to optimize filtered perf_iterate_ctx
>>
>> include/linux/perf_event.h | 13 ++
>> kernel/events/core.c | 466
>> +++++++++++++++++++++++++++++++++++++++------
>> 2 files changed, 426 insertions(+), 53 deletions(-)
>>
>> --
>> 2.11.0.390.gc69c2f50cf-goog
>

--------------------------------------------------------------------
Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Next message: Tejun Heo: "Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity"
Previous message: Fabian Frederick: "[PATCH 1/1 linux-next] btrfs: kmap() can't fail"
In reply to: David Carrillo-Cisneros: "Re: [RFC 0/6] optimize ctx switch with rb-tree"
Next in thread: David Carrillo-Cisneros: "Re: [RFC 0/6] optimize ctx switch with rb-tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]