Re: [PATCH 2/2] perf/x86/intel/ds: Use the size from each PEBS record
From: Peter Zijlstra
Date: Thu Apr 06 2023 - 09:14:07 EST
On Tue, Mar 28, 2023 at 03:27:35PM -0700, kan.liang@xxxxxxxxxxxxxxx wrote:
> From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
>
> The kernel warning for the unexpected PEBS record can also be observed
> during a context switch, when the below commands are running in parallel
> for a while on SPR.
>
> while true; do perf record --no-buildid -a --intr-regs=AX -e
> cpu/event=0xd0,umask=0x81/pp -c 10003 -o /dev/null ./triad; done &
>
> while true; do perf record -o /tmp/out -W -d -e
> '{ld_blocks.store_forward:period=1000000,
> MEM_TRANS_RETIRED.LOAD_LATENCY:u:precise=2:ldlat=4}'
> -c 1037 ./triad; done
> *The triad program is just the generation of loads/stores.
>
> The current PEBS code assumes that all the PEBS records in the DS buffer
> have the same size, aka cpuc->pebs_record_size. It's true for the most
> cases, since the DS buffer is always flushed in every context switch.
>
> However, there is a corner case that breaks the assumption.
> A system-wide PEBS event with the large PEBS config may be enabled
> during a context switch. Some PEBS records for the system-wide PEBS may
> be generated while the old task is sched out but the new one hasn't been
> sched in yet. When the new task is sched in, the cpuc->pebs_record_size
> may be updated for the per-task PEBS events. So the existing system-wide
> PEBS records have a different size from the later PEBS records.
>
> Two methods were considered to fix the issue.
> One is to flush the DS buffer for the system-wide PEBS right before the
> new task sched in. It has to be done in the generic code via the
> sched_task() call back. However, the sched_task() is shared among
> different ARCHs. The movement may impact other ARCHs, e.g., AMD BRS
> requires the sched_task() is called after the PMU has started on a
> ctxswin. The method is dropped.
>
> The other method is implemented here. It doesn't assume that all the
> PEBS records have the same size any more. The size from each PEBS record
> is used to parse the record. For the previous platform (PEBS format < 4),
> which doesn't support adaptive PEBS, there is nothing changed.
Same as with the other; why can't we flush the buffer when we reprogram
the hardware?