Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

From: Kim Phillips
Date: Thu Mar 26 2020 - 15:49:50 EST




On 3/26/20 5:19 AM, maddy wrote:
>
>
> On 3/18/20 11:05 PM, Kim Phillips wrote:
>> Hi Maddy,
>>
>> On 3/17/20 1:50 AM, maddy wrote:
>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>>
>>>>>>>>>>> Implementation detail:
>>>>>>>>>>>
>>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>>> into generic format:
>>>>>>>>>>>
>>>>>>>>>>> ÂÂÂÂÂ struct perf_pipeline_haz_data {
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ itype;
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ /* Instruction Cache source */
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ icache;
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_stage;
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ /* Hazard reason */
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_reason;
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered stall in pipeline stage */
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_stage;
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ /* Stall reason */
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_reason;
>>>>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂ __u16ÂÂ pad;
>>>>>>>>>>> ÂÂÂÂÂ };
>>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>>> IBM's AFAICT.
>>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>>>> data and latency information.
>>>>>>>
>>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>>>> bad events. IBS' PPR descriptions has one occurrence of the
>>>>>>>> word stall, and no penalty. The way I read IBS is it's just
>>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>>> bikeshed.
>>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>>>> word. We are open to term which can be generic enough.
>>>>>
>>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>>> of information coming out of it, but the vast majority
>>>>>> are addresses, latencies of various components in the memory
>>>>>> hierarchy, and various component hit/miss bits.
>>>>> Yes. we should capture core pipeline specific details. For example,
>>>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>>> part of perf-mem, IMO.
>>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>>
>>>> union perf_mem_data_src {
>>>> ...
>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u64ÂÂ mem_rsvd:24,
>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_snoopx:2,ÂÂ /* snoop mode, ext */
>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_remote:1,ÂÂ /* remote */
>>>>  mem_lvl_num:4, /* memory hierarchy level number */
>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_dtlb:7,ÂÂÂÂ /* tlb access */
>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_lock:2,ÂÂÂÂ /* lock instr */
>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_snoop:5,ÂÂÂ /* snoop mode */
>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_lvl:14,ÂÂÂÂ /* memory hierarchy level */
>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_op:5;ÂÂÂÂÂÂ /* type of opcode */
>>>>
>>>>
>>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>>> be used to populate mem_snoop, right?
>>> Hi Kim,
>>>
>>> Yes. We do expose these data as part of perf-mem for POWER.
>> OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
>> isa207_find_source now, thanks.
>>
>>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>>> used for the ld/st target addresses, too.
>>>>
>>>>>> What's needed here is a vendor-specific extended
>>>>>> sample information that all these technologies gather,
>>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>>> all should have in common.
>>>>> Yes. We will include fields to capture the latency cycles (like Issue
>>>>> latency, Instruction completion latency etc..) along with other pipeline
>>>>> details in the proposed structure.
>>>> Latency figures are just an example, and from what I
>>>> can tell, struct perf_sample_data already has a 'weight' member,
>>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>>> transfer memory access latency figures. Granted, that's
>>>> a bad name given all other vendors don't call latency
>>>> 'weight'.
>>>>
>>>> I didn't see any latency figures coming out of POWER9,
>>>> and do not expect this patchseries to implement those
>>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>>> to amend perf to suit their own h/w output please.
>>> Reference structure proposed in this patchset did not have members
>>> to capture latency info for that exact reason. But idea here is to
>>> abstract as vendor specific as possible. So if we include u16 array,
>>> then this format can also capture data from IBS since it provides
>>> few latency details.
>> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
>> struct presented in this patchset.
>>
>> IBS Ops can report e.g.:
>>
>> 15 tag-to-retire cycles bits,
>> 15 completion to retire count bits,
>> 15 L1 DTLB refill latency bits,
>> 15 DC miss latency bits,
>> 5 outstanding memory requests on mem refill bits, and so on.
>>
>> IBS Fetch reports 15 bits of fetch latency, and another 16
>> for iTLB latency, among others.
>>
>> Some of these may/may not be valid simultaneously, and
>> there are IBS specific rules to establish validity.
>>
>>>> My main point there, however, was that each vendor should
>>>> use streamlined record-level code to just copy the data
>>>> in the proprietary format that their hardware produces,
>>>> and then then perf tooling can synthesize the events
>>>> from the raw data at report/script/etc. time.
>>>>
>>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>>> either. Can we use PERF_SAMPLE_AUX instead?
>>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>>>> large volume of data needs to be captured as part of perf.data without
>>>>> frequent PMIs. But proposed type is to address the capture of pipeline
>>>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>>>> PMIs are, even though it may be used in those environments.
>>>>
>>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>> And that's fine for any extra bits that POWER9 has to convey
>>>> to its users beyond things already represented by other sample
>>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>>> what IBS currently uses.
>>> My bad. Not sure what you mean by this. We are trying to abstract
>>> as much vendor specific data as possible with this (like perf-mem).
>> Perhaps if I say it this way: instead of doing all the
>> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
>> in patch 4/11, rather/instead just put the raw sier value in a
>> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
>> Specific SIER capabilities can be written as part of the perf.data
>> header. Then synthesize the true pipe events from the raw SIER
>> values later, and in userspace.
>
> Hi Kim,
>
> Would like to stay away from SAMPLE_RAW type for these comments in perf_events.h
>
> *ÂÂÂÂÂ #
> *ÂÂÂÂÂ # The RAW record below is opaque data wrt the ABI
> *ÂÂÂÂÂ #
> *ÂÂÂÂÂ # That is, the ABI doesn't make any promises wrt to
> *ÂÂÂÂÂ # the stability of its content, it may vary depending
> *ÂÂÂÂÂ # on event, hardware, kernel version and phase of
> *ÂÂÂÂÂ # the moon.
> *ÂÂÂÂÂ #
> *ÂÂÂÂÂ # In other words, PERF_SAMPLE_RAW contents are not an ABI.
> *ÂÂÂÂÂ #

The "it may vary depending on ... hardware" clause makes it sound
appropriate for the use-case where the raw hardware register contents
are copied directly into the user buffer.

> Secondly, sorry I didn't understand your suggestion about using PERF_SAMPLE_AUX.
> IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory and more
> challenging when correlating and presenting the pipeline details for each IP.
> IMO, having a new sample type can be useful to capture the pipeline data
> both in perf_sample_data and if _AUX is enabled, can be made to push to
> AUX buffer.

OK, I didn't think SAMPLE_AUX and the aux ring buffer were
interdependent, sorry.

Thanks,

Kim