Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

From: Kim Phillips
Date: Wed Mar 18 2020 - 13:36:04 EST


Hi Maddy,

On 3/17/20 1:50 AM, maddy wrote:
> On 3/13/20 4:08 AM, Kim Phillips wrote:
>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>
>>>>>>>>> Implementation detail:
>>>>>>>>>
>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>> into generic format:
>>>>>>>>>
>>>>>>>>> ÂÂÂÂ struct perf_pipeline_haz_data {
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ itype;
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ /* Instruction Cache source */
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ icache;
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered hazard in pipeline stage */
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_stage;
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ /* Hazard reason */
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_reason;
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered stall in pipeline stage */
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_stage;
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ /* Stall reason */
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_reason;
>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ __u16ÂÂ pad;
>>>>>>>>> ÂÂÂÂ };
>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>> IBM's AFAICT.
>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>> data and latency information.
>>>>>
>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>> bad events. IBS' PPR descriptions has one occurrence of the
>>>>>> word stall, and no penalty. The way I read IBS is it's just
>>>>>> reporting more sample data than just the precise IP: things like
>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>> bikeshed.
>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>> word. We are open to term which can be generic enough.
>>>
>>>> I realize there are a couple of core pipeline-specific pieces
>>>> of information coming out of it, but the vast majority
>>>> are addresses, latencies of various components in the memory
>>>> hierarchy, and various component hit/miss bits.
>>> Yes. we should capture core pipeline specific details. For example,
>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>> part of perf-mem, IMO.
>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>> should populate perf_mem_data_src fields, just like POWER9 can:
>>
>> union perf_mem_data_src {
>> ...
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u64ÂÂ mem_rsvd:24,
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_snoopx:2,ÂÂ /* snoop mode, ext */
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_remote:1,ÂÂ /* remote */
>>  mem_lvl_num:4, /* memory hierarchy level number */
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_dtlb:7,ÂÂÂÂ /* tlb access */
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_lock:2,ÂÂÂÂ /* lock instr */
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_snoop:5,ÂÂÂ /* snoop mode */
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_lvl:14,ÂÂÂÂ /* memory hierarchy level */
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_op:5;ÂÂÂÂÂÂ /* type of opcode */
>>
>>
>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>> 'mem_lock', and the Reload Bus Source Encoding bits can
>> be used to populate mem_snoop, right?
> Hi Kim,
>
> Yes. We do expose these data as part of perf-mem for POWER.

OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
isa207_find_source now, thanks.

>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>> used for the ld/st target addresses, too.
>>
>>>> What's needed here is a vendor-specific extended
>>>> sample information that all these technologies gather,
>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>> all should have in common.
>>> Yes. We will include fields to capture the latency cycles (like Issue
>>> latency, Instruction completion latency etc..) along with other pipeline
>>> details in the proposed structure.
>> Latency figures are just an example, and from what I
>> can tell, struct perf_sample_data already has a 'weight' member,
>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>> transfer memory access latency figures. Granted, that's
>> a bad name given all other vendors don't call latency
>> 'weight'.
>>
>> I didn't see any latency figures coming out of POWER9,
>> and do not expect this patchseries to implement those
>> of other vendors, e.g., AMD's IBS; leave each vendor
>> to amend perf to suit their own h/w output please.
>
> Reference structure proposed in this patchset did not have members
> to capture latency info for that exact reason. But idea here is to
> abstract as vendor specific as possible. So if we include u16 array,
> then this format can also capture data from IBS since it provides
> few latency details.

OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
struct presented in this patchset.

IBS Ops can report e.g.:

15 tag-to-retire cycles bits,
15 completion to retire count bits,
15 L1 DTLB refill latency bits,
15 DC miss latency bits,
5 outstanding memory requests on mem refill bits, and so on.

IBS Fetch reports 15 bits of fetch latency, and another 16
for iTLB latency, among others.

Some of these may/may not be valid simultaneously, and
there are IBS specific rules to establish validity.

>> My main point there, however, was that each vendor should
>> use streamlined record-level code to just copy the data
>> in the proprietary format that their hardware produces,
>> and then then perf tooling can synthesize the events
>> from the raw data at report/script/etc. time.
>>
>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>> either. Can we use PERF_SAMPLE_AUX instead?
>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>> large volume of data needs to be captured as part of perf.data without
>>> frequent PMIs. But proposed type is to address the capture of pipeline
>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>> PMIs are, even though it may be used in those environments.
>>
>>> information on each sample using PMI at periodic intervals. Hence proposing
>>> PERF_SAMPLE_PIPELINE_HAZ.
>> And that's fine for any extra bits that POWER9 has to convey
>> to its users beyond things already represented by other sample
>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>> and other vendor e.g., AMD IBS data can be made vendor-independent
>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>> what IBS currently uses.
>
> My bad. Not sure what you mean by this. We are trying to abstract
> as much vendor specific data as possible with this (like perf-mem).

Perhaps if I say it this way: instead of doing all the
isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
in patch 4/11, rather/instead just put the raw sier value in a
PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
Specific SIER capabilities can be written as part of the perf.data
header. Then synthesize the true pipe events from the raw SIER
values later, and in userspace.

I guess it's technically optional, but I think that's how
I'd do it in IBS, since it minimizes the record-time overhead.

Thanks,

Kim

> Maddy
>>
>>>> ÂÂTake a look at
>>>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>>>> definitions". The sample identifier can be used to determine
>>>> which vendor's sampling IP's data is in it, and events can
>>>> be recorded just by copying the content of the SIER, etc.
>>>> registers, and then events get synthesized from the aux
>>>> sample at report/inject/annotate etc. time. This allows
>>>> for less sample recording overhead, and moves all the vendor
>>>> specific decoding and common event conversions for userspace
>>>> to figure out.
>>> When AUX buffer data is structured, tool side changes added to present the
>>> pipeline data can be re-used.
>> Not sure I understand: AUX data would be structured on
>> each vendor's raw h/w register formats.
>>
>> Thanks,
>>
>> Kim
>>
>>>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>>>> Profiling Extension) which is their version of IBS.
>>>>>>> Whatever gets added need to cover all three with no limitations.
>>>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>>>> similar sample data in perf already, like with perf mem/c2c?
>>>>> perf-mem is more of data centric in my opinion. It is more towards
>>>>> memory profiling. So proposal here is to expose pipeline related
>>>>> details like stalls and latencies.
>>>> Like I said, I don't see it that way, I see it as "any particular
>>>> vendor's event's extended details', and these pipeline details
>>>> have overlap with existing infrastructure within perf, e.g., L2
>>>> cache misses.
>>>>
>>>> Kim
>>>>
>