Re: [PATCH 1/2] perf: Add sched_task callback during ctx reschedule

From: Peter Zijlstra
Date: Thu Apr 06 2023 - 09:11:05 EST


On Tue, Mar 28, 2023 at 03:27:34PM -0700, kan.liang@xxxxxxxxxxxxxxx wrote:
> From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
>
> Several similar kernel warnings can be triggered,
>
> [56605.607840] CPU0 PEBS record size 0, expected 32, config 0
> cpuc->record_size=208
>
> when the below commands are running in parallel for a while on SPR.
>
> while true; do perf record --no-buildid -a --intr-regs=AX -e
> cpu/event=0xd0,umask=0x81/pp -c 10003 -o /dev/null ./triad; done &
>
> while true; do perf record -o /tmp/out -W -d -e
> '{ld_blocks.store_forward:period=1000000,
> MEM_TRANS_RETIRED.LOAD_LATENCY:u:precise=2:ldlat=4}'
> -c 1037 ./triad; done
> *The triad program is just the generation of loads/stores.
>
> The warnings are triggered when an unexpected PEBS record (with a
> different config and size) is found.
>
> In a context switch, different events may be applied to the old task and
> the new task. The sched_task callback is used to switch the PMU-specific
> data. For the PEBS, the callback flushes the DS area and parses the PEBS
> records from the old task when schedule out. The new task never sees the
> stale PEBS records.
>
> However, the exec() doesn't have similar support. The new command may
> see the PEBS records from the old command. In the perf_event_exec(), the
> ctx_resched() is invoked to reschedule the context. It may updates the
> active events, which may change the global PEBS configuration. The PEBS
> record from the old command may have a different config and size from
> the new command. The warning is triggered.
>
> The sched_task callback should be invoked in all the places where the
> context can be changed. Add the sched_task callback in the ctx_resched()
> as well.
>
> Reported-by: Stephane Eranian <eranian@xxxxxxxxxx>
> Signed-off-by: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
> ---
> kernel/events/core.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f79fd8b87f75..0c49183656fd 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2642,6 +2642,8 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
> ctx_sched_in(ctx, EVENT_FLEXIBLE);
> }
>
> +static void perf_ctx_sched_task_cb(struct perf_event_context *ctx, bool sched_in);
> +
> /*
> * We want to maintain the following priority of scheduling:
> * - CPU pinned (EVENT_CPU | EVENT_PINNED)
> @@ -2680,6 +2682,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> perf_ctx_disable(&cpuctx->ctx);
> if (task_ctx) {
> perf_ctx_disable(task_ctx);
> + perf_ctx_sched_task_cb(task_ctx, false);
> task_ctx_sched_out(task_ctx, event_type);
> }
>
> @@ -2698,8 +2701,10 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> perf_event_sched_in(cpuctx, task_ctx);
>
> perf_ctx_enable(&cpuctx->ctx);
> - if (task_ctx)
> + if (task_ctx) {
> + perf_ctx_sched_task_cb(task_ctx, true);
> perf_ctx_enable(task_ctx);
> + }
> }
>

Urgh... yuck.. Also I think you missed perf_rotate_context() which has
an open coded resched.

But I think this is a very fragile approach. Why can't the x86 pmu/pebs
driver not flush when it's programming changes -- it, as no other, knows
when this happens.