Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

From: maddy
Date: Tue Mar 17 2020 - 02:50:50 EST




On 3/13/20 4:08 AM, Kim Phillips wrote:
On 3/11/20 11:00 AM, Ravi Bangoria wrote:
Hi Kim,
Hi Ravi,

On 3/6/20 3:36 AM, Kim Phillips wrote:
On 3/3/20 3:55 AM, Kim Phillips wrote:
On 3/2/20 2:21 PM, Stephane Eranian wrote:
On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
Modern processors export such hazard data in Performance
Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
AMD[3] provides similar information.

Implementation detail:

A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
If it's set, kernel converts arch specific hazard information
into generic format:

ÂÂÂ struct perf_pipeline_haz_data {
ÂÂÂÂÂÂÂÂÂÂ /* Instruction/Opcode type: Load, Store, Branch .... */
ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ itype;
ÂÂÂÂÂÂÂÂÂÂ /* Instruction Cache source */
ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ icache;
ÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered hazard in pipeline stage */
ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_stage;
ÂÂÂÂÂÂÂÂÂÂ /* Hazard reason */
ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_reason;
ÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered stall in pipeline stage */
ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_stage;
ÂÂÂÂÂÂÂÂÂÂ /* Stall reason */
ÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_reason;
ÂÂÂÂÂÂÂÂÂÂ __u16ÂÂ pad;
ÂÂÂ };
Kim, does this format indeed work for AMD IBS?
It's not really 1:1, we don't have these separations of stages
and reasons, for example: we have missed in L2 cache, for example.
So IBS output is flatter, with more cycle latency figures than
IBM's AFAICT.
AMD IBS captures pipeline latency data incase Fetch sampling like the
Fetch latency, tag to retire latency, completion to retire latency and
so on. Yes, Ops sampling do provide more data on load/store centric
information. But it also captures more detailed data for Branch instructions.
And we also looked at ARM SPE, which also captures more details pipeline
data and latency information.

Personally, I don't like the term hazard. This is too IBM Power
specific. We need to find a better term, maybe stall or penalty.
Right, IBS doesn't have a filter to only count stalled or otherwise
bad events. IBS' PPR descriptions has one occurrence of the
word stall, and no penalty. The way I read IBS is it's just
reporting more sample data than just the precise IP: things like
hits, misses, cycle latencies, addresses, types, etc., so words
like 'extended', or the 'auxiliary' already used today even
are more appropriate for IBS, although I'm the last person to
bikeshed.
We are thinking of using "pipeline" word instead of Hazard.
Hm, the word 'pipeline' occurs 0 times in IBS documentation.
NP. We thought pipeline is generic hw term so we proposed "pipeline"
word. We are open to term which can be generic enough.

I realize there are a couple of core pipeline-specific pieces
of information coming out of it, but the vast majority
are addresses, latencies of various components in the memory
hierarchy, and various component hit/miss bits.
Yes. we should capture core pipeline specific details. For example,
IBS generates Branch unit information(IbsOpData1) and Icahce related
data(IbsFetchCtl) which is something that shouldn't be extended as
part of perf-mem, IMO.
Sure, IBS Op-side output is more 'perf mem' friendly, and so it
should populate perf_mem_data_src fields, just like POWER9 can:

union perf_mem_data_src {
...
__u64 mem_rsvd:24,
mem_snoopx:2, /* snoop mode, ext */
mem_remote:1, /* remote */
mem_lvl_num:4, /* memory hierarchy level number */
mem_dtlb:7, /* tlb access */
mem_lock:2, /* lock instr */
mem_snoop:5, /* snoop mode */
mem_lvl:14, /* memory hierarchy level */
mem_op:5; /* type of opcode */


E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
'mem_lock', and the Reload Bus Source Encoding bits can
be used to populate mem_snoop, right?
Hi Kim,

Yes. We do expose these data as part of perf-mem for POWER.


For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
used for the ld/st target addresses, too.

What's needed here is a vendor-specific extended
sample information that all these technologies gather,
of which things like e.g., 'L1 TLB cycle latency' we
all should have in common.
Yes. We will include fields to capture the latency cycles (like Issue
latency, Instruction completion latency etc..) along with other pipeline
details in the proposed structure.
Latency figures are just an example, and from what I
can tell, struct perf_sample_data already has a 'weight' member,
used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
transfer memory access latency figures. Granted, that's
a bad name given all other vendors don't call latency
'weight'.

I didn't see any latency figures coming out of POWER9,
and do not expect this patchseries to implement those
of other vendors, e.g., AMD's IBS; leave each vendor
to amend perf to suit their own h/w output please.

Reference structure proposed in this patchset did not have members
to capture latency info for that exact reason. But idea here is to
abstract as vendor specific as possible. So if we include u16 array,
then this format can also capture data from IBS since it provides
few latency details.



My main point there, however, was that each vendor should
use streamlined record-level code to just copy the data
in the proprietary format that their hardware produces,
and then then perf tooling can synthesize the events
from the raw data at report/script/etc. time.

I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
either. Can we use PERF_SAMPLE_AUX instead?
We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
large volume of data needs to be captured as part of perf.data without
frequent PMIs. But proposed type is to address the capture of pipeline
SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
PMIs are, even though it may be used in those environments.

information on each sample using PMI at periodic intervals. Hence proposing
PERF_SAMPLE_PIPELINE_HAZ.
And that's fine for any extra bits that POWER9 has to convey
to its users beyond things already represented by other sample
types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
and other vendor e.g., AMD IBS data can be made vendor-independent
at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
what IBS currently uses.

My bad. Not sure what you mean by this. We are trying to abstract
as much vendor specific data as possible with this (like perf-mem).


Maddy

ÂTake a look at
commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
definitions". The sample identifier can be used to determine
which vendor's sampling IP's data is in it, and events can
be recorded just by copying the content of the SIER, etc.
registers, and then events get synthesized from the aux
sample at report/inject/annotate etc. time. This allows
for less sample recording overhead, and moves all the vendor
specific decoding and common event conversions for userspace
to figure out.
When AUX buffer data is structured, tool side changes added to present the
pipeline data can be re-used.
Not sure I understand: AUX data would be structured on
each vendor's raw h/w register formats.

Thanks,

Kim

Also worth considering is the support of ARM SPE (Statistical
Profiling Extension) which is their version of IBS.
Whatever gets added need to cover all three with no limitations.
I thought Intel's various LBR, PEBS, and PT supported providing
similar sample data in perf already, like with perf mem/c2c?
perf-mem is more of data centric in my opinion. It is more towards
memory profiling. So proposal here is to expose pipeline related
details like stalls and latencies.
Like I said, I don't see it that way, I see it as "any particular
vendor's event's extended details', and these pipeline details
have overlap with existing infrastructure within perf, e.g., L2
cache misses.

Kim