Re: [PATCH v4 4/4] perf/core,x86: synchronize PMU task contexts on optimized context switches

From: Alexey Budankov
Date: Tue Oct 22 2019 - 08:49:16 EST



On 22.10.2019 12:43, Peter Zijlstra wrote:
> On Tue, Oct 22, 2019 at 09:01:11AM +0300, Alexey Budankov wrote:
>
>> swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
>>
>> + /*
>> + * PMU specific parts of task perf context can require
>> + * additional synchronization which makes sense only if
>> + * both next_ctx->task_ctx_data and ctx->task_ctx_data
>> + * pointers are allocated. As an example of such
>> + * synchronization see implementation details of Intel
>> + * LBR call stack data profiling;
>> + */
>> + if (ctx->task_ctx_data && next_ctx->task_ctx_data)
>> + pmu->sync_task_ctx(next_ctx->task_ctx_data,
>> + ctx->task_ctx_data);
>
> This still does not check if pmu->sync_task_ctx is set. If any other
> arch ever uses task_ctx_data without then also supplying this method
> things will go *bang*.
>
> Also, I think I prefer the variant I gave you yesterday:
>
> https://lkml.kernel.org/r/20191021103745.GF1800@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> if (pmu->swap_task_ctx)
> pmu->swap_task_ctx(ctx, next_ctx);
> else
> swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
>
> That also unconfuses the argument order in your above patch (where you
> have to undo thw swap).
>
> Alternatively, since there currently is no other arch using
> task_ctx_data, we can make the pmu::swap_task_ctx() thing mandatory when
> having it and completely replace the swap(), write it like so:
>
>
> - swap(ctx->task_ctx_data, next_ctx->task_ctx_data);

It still has to be swapped unconditionally. Thus, it will be a part of
architecture specific implementation:

void intel_pmu_lbr_sync_task_ctx(struct x86_perf_task_context **prev,
struct x86_perf_task_context **next)
{
if (*prev && *next) {
swap(*prev->lbr_callstack_users, *next->lbr_callstack_users);
...
}
swap(prev, next);
}


> + if (pmu->swap_task_ctx)
> + pmu->swap_task_ctx(ctx, next_ctx);
>
> Hmm?

This option above looks attractive because it pushes complexity down
towards architecture specific implementation.

However, in order to keep the existing performance at the same level
if (ctx->task_ctx_data && next_ctx->task_ctx_data) check has to be
preserved as closer to the top layer as possible. So the fastest version
appears to look like this:

swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
if (ctx->task_ctx_data && next_ctx->task_ctx_data && pmu->sync_task_ctx)
pmu->sync_task_ctx(next_ctx->task_ctx_data, ctx->task_ctx_data);

If some architecture needs specific synchronization then it is enough
to implement sync_task_ctx() without changing the core.

~Alexey