Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information
From: Kim Phillips
Date: Thu Mar 12 2020 - 18:38:51 EST
On 3/11/20 11:00 AM, Ravi Bangoria wrote:
> Hi Kim,
Hi Ravi,
> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>> Modern processors export such hazard data in Performance
>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>> AMD[3] provides similar information.
>>>>>>>
>>>>>>> Implementation detail:
>>>>>>>
>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>> into generic format:
>>>>>>>
>>>>>>> ÂÂÂ struct perf_pipeline_haz_data {
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ itype;
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ /* Instruction Cache source */
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ icache;
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered hazard in pipeline stage */
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_stage;
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ /* Hazard reason */
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_reason;
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered stall in pipeline stage */
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_stage;
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ /* Stall reason */
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_reason;
>>>>>>> ÂÂÂÂÂÂÂÂÂÂ __u16ÂÂ pad;
>>>>>>> ÂÂÂ };
>>>>>>
>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>
>>>> It's not really 1:1, we don't have these separations of stages
>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>> So IBS output is flatter, with more cycle latency figures than
>>>> IBM's AFAICT.
>>>
>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>> Fetch latency, tag to retire latency, completion to retire latency and
>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>> information. But it also captures more detailed data for Branch instructions.
>>> And we also looked at ARM SPE, which also captures more details pipeline
>>> data and latency information.
>>>
>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>
>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>> bad events. IBS' PPR descriptions has one occurrence of the
>>>> word stall, and no penalty. The way I read IBS is it's just
>>>> reporting more sample data than just the precise IP: things like
>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>> like 'extended', or the 'auxiliary' already used today even
>>>> are more appropriate for IBS, although I'm the last person to
>>>> bikeshed.
>>>
>>> We are thinking of using "pipeline" word instead of Hazard.
>>
>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>
> NP. We thought pipeline is generic hw term so we proposed "pipeline"
> word. We are open to term which can be generic enough.
>
>>
>> I realize there are a couple of core pipeline-specific pieces
>> of information coming out of it, but the vast majority
>> are addresses, latencies of various components in the memory
>> hierarchy, and various component hit/miss bits.
>
> Yes. we should capture core pipeline specific details. For example,
> IBS generates Branch unit information(IbsOpData1) and Icahce related
> data(IbsFetchCtl) which is something that shouldn't be extended as
> part of perf-mem, IMO.
Sure, IBS Op-side output is more 'perf mem' friendly, and so it
should populate perf_mem_data_src fields, just like POWER9 can:
union perf_mem_data_src {
...
__u64 mem_rsvd:24,
mem_snoopx:2, /* snoop mode, ext */
mem_remote:1, /* remote */
mem_lvl_num:4, /* memory hierarchy level number */
mem_dtlb:7, /* tlb access */
mem_lock:2, /* lock instr */
mem_snoop:5, /* snoop mode */
mem_lvl:14, /* memory hierarchy level */
mem_op:5; /* type of opcode */
E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
'mem_lock', and the Reload Bus Source Encoding bits can
be used to populate mem_snoop, right?
For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
used for the ld/st target addresses, too.
>> What's needed here is a vendor-specific extended
>> sample information that all these technologies gather,
>> of which things like e.g., 'L1 TLB cycle latency' we
>> all should have in common.
>
> Yes. We will include fields to capture the latency cycles (like Issue
> latency, Instruction completion latency etc..) along with other pipeline
> details in the proposed structure.
Latency figures are just an example, and from what I
can tell, struct perf_sample_data already has a 'weight' member,
used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
transfer memory access latency figures. Granted, that's
a bad name given all other vendors don't call latency
'weight'.
I didn't see any latency figures coming out of POWER9,
and do not expect this patchseries to implement those
of other vendors, e.g., AMD's IBS; leave each vendor
to amend perf to suit their own h/w output please.
My main point there, however, was that each vendor should
use streamlined record-level code to just copy the data
in the proprietary format that their hardware produces,
and then then perf tooling can synthesize the events
from the raw data at report/script/etc. time.
>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>> either. Can we use PERF_SAMPLE_AUX instead?
>
> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
> large volume of data needs to be captured as part of perf.data without
> frequent PMIs. But proposed type is to address the capture of pipeline
SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
PMIs are, even though it may be used in those environments.
> information on each sample using PMI at periodic intervals. Hence proposing
> PERF_SAMPLE_PIPELINE_HAZ.
And that's fine for any extra bits that POWER9 has to convey
to its users beyond things already represented by other sample
types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
and other vendor e.g., AMD IBS data can be made vendor-independent
at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
what IBS currently uses.
>> ÂTake a look at
>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>> definitions". The sample identifier can be used to determine
>> which vendor's sampling IP's data is in it, and events can
>> be recorded just by copying the content of the SIER, etc.
>> registers, and then events get synthesized from the aux
>> sample at report/inject/annotate etc. time. This allows
>> for less sample recording overhead, and moves all the vendor
>> specific decoding and common event conversions for userspace
>> to figure out.
>
> When AUX buffer data is structured, tool side changes added to present the
> pipeline data can be re-used.
Not sure I understand: AUX data would be structured on
each vendor's raw h/w register formats.
Thanks,
Kim
>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>> Profiling Extension) which is their version of IBS.
>>>>> Whatever gets added need to cover all three with no limitations.
>>>>
>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>> similar sample data in perf already, like with perf mem/c2c?
>>>
>>> perf-mem is more of data centric in my opinion. It is more towards
>>> memory profiling. So proposal here is to expose pipeline related
>>> details like stalls and latencies.
>>
>> Like I said, I don't see it that way, I see it as "any particular
>> vendor's event's extended details', and these pipeline details
>> have overlap with existing infrastructure within perf, e.g., L2
>> cache misses.
>>
>> Kim
>>
>