On 3/27/20 1:18 AM, Kim Phillips wrote:
On 3/26/20 5:19 AM, maddy wrote:
The "it may vary depending on ... hardware" clause makes it sound
On 3/18/20 11:05 PM, Kim Phillips wrote:
Hi Maddy,Hi Kim,
On 3/17/20 1:50 AM, maddy wrote:
On 3/13/20 4:08 AM, Kim Phillips wrote:OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
On 3/11/20 11:00 AM, Ravi Bangoria wrote:Hi Kim,
On 3/6/20 3:36 AM, Kim Phillips wrote:Sure, IBS Op-side output is more 'perf mem' friendly, and so it
NP. We thought pipeline is generic hw term so we proposed "pipeline"On 3/3/20 3:55 AM, Kim Phillips wrote:Hm, the word 'pipeline' occurs 0 times in IBS documentation.
On 3/2/20 2:21 PM, Stephane Eranian wrote:AMD IBS captures pipeline latency data incase Fetch sampling like the
On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:It's not really 1:1, we don't have these separations of stagesOn Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
Modern processors export such hazard data in PerformanceKim, does this format indeed work for AMD IBS?
Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
AMD[3] provides similar information.
Implementation detail:
A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
If it's set, kernel converts arch specific hazard information
into generic format:
ÂÂÂÂÂÂ struct perf_pipeline_haz_data {
ÂÂÂÂÂÂÂÂÂÂÂÂÂ /* Instruction/Opcode type: Load, Store, Branch .... */
ÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ itype;
ÂÂÂÂÂÂÂÂÂÂÂÂÂ /* Instruction Cache source */
ÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ icache;
ÂÂÂÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered hazard in pipeline stage */
ÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_stage;
ÂÂÂÂÂÂÂÂÂÂÂÂÂ /* Hazard reason */
ÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ hazard_reason;
ÂÂÂÂÂÂÂÂÂÂÂÂÂ /* Instruction suffered stall in pipeline stage */
ÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_stage;
ÂÂÂÂÂÂÂÂÂÂÂÂÂ /* Stall reason */
ÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8ÂÂÂ stall_reason;
ÂÂÂÂÂÂÂÂÂÂÂÂÂ __u16ÂÂ pad;
ÂÂÂÂÂÂ };
and reasons, for example: we have missed in L2 cache, for example.
So IBS output is flatter, with more cycle latency figures than
IBM's AFAICT.
Fetch latency, tag to retire latency, completion to retire latency and
so on. Yes, Ops sampling do provide more data on load/store centric
information. But it also captures more detailed data for Branch instructions.
And we also looked at ARM SPE, which also captures more details pipeline
data and latency information.
We are thinking of using "pipeline" word instead of Hazard.Personally, I don't like the term hazard. This is too IBM PowerRight, IBS doesn't have a filter to only count stalled or otherwise
specific. We need to find a better term, maybe stall or penalty.
bad events. IBS' PPR descriptions has one occurrence of the
word stall, and no penalty. The way I read IBS is it's just
reporting more sample data than just the precise IP: things like
hits, misses, cycle latencies, addresses, types, etc., so words
like 'extended', or the 'auxiliary' already used today even
are more appropriate for IBS, although I'm the last person to
bikeshed.
word. We are open to term which can be generic enough.
I realize there are a couple of core pipeline-specific piecesYes. we should capture core pipeline specific details. For example,
of information coming out of it, but the vast majority
are addresses, latencies of various components in the memory
hierarchy, and various component hit/miss bits.
IBS generates Branch unit information(IbsOpData1) and Icahce related
data(IbsFetchCtl) which is something that shouldn't be extended as
part of perf-mem, IMO.
should populate perf_mem_data_src fields, just like POWER9 can:
union perf_mem_data_src {
...
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u64ÂÂ mem_rsvd:24,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_snoopx:2,ÂÂ /* snoop mode, ext */
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_remote:1,ÂÂ /* remote */
 mem_lvl_num:4, /* memory hierarchy level number */
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_dtlb:7,ÂÂÂÂ /* tlb access */
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_lock:2,ÂÂÂÂ /* lock instr */
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_snoop:5,ÂÂÂ /* snoop mode */
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_lvl:14,ÂÂÂÂ /* memory hierarchy level */
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mem_op:5;ÂÂÂÂÂÂ /* type of opcode */
E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
'mem_lock', and the Reload Bus Source Encoding bits can
be used to populate mem_snoop, right?
Yes. We do expose these data as part of perf-mem for POWER.
isa207_find_source now, thanks.
OK, that sounds a bit different from the 6 x u8's + 1 u16 paddedFor IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can beReference structure proposed in this patchset did not have members
used for the ld/st target addresses, too.
Latency figures are just an example, and from what IWhat's needed here is a vendor-specific extendedYes. We will include fields to capture the latency cycles (like Issue
sample information that all these technologies gather,
of which things like e.g., 'L1 TLB cycle latency' we
all should have in common.
latency, Instruction completion latency etc..) along with other pipeline
details in the proposed structure.
can tell, struct perf_sample_data already has a 'weight' member,
used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
transfer memory access latency figures. Granted, that's
a bad name given all other vendors don't call latency
'weight'.
I didn't see any latency figures coming out of POWER9,
and do not expect this patchseries to implement those
of other vendors, e.g., AMD's IBS; leave each vendor
to amend perf to suit their own h/w output please.
to capture latency info for that exact reason. But idea here is to
abstract as vendor specific as possible. So if we include u16 array,
then this format can also capture data from IBS since it provides
few latency details.
struct presented in this patchset.
IBS Ops can report e.g.:
15 tag-to-retire cycles bits,
15 completion to retire count bits,
15 L1 DTLB refill latency bits,
15 DC miss latency bits,
5 outstanding memory requests on mem refill bits, and so on.
IBS Fetch reports 15 bits of fetch latency, and another 16
for iTLB latency, among others.
Some of these may/may not be valid simultaneously, and
there are IBS specific rules to establish validity.
Perhaps if I say it this way: instead of doing all theMy main point there, however, was that each vendor shouldMy bad. Not sure what you mean by this. We are trying to abstract
use streamlined record-level code to just copy the data
in the proprietary format that their hardware produces,
and then then perf tooling can synthesize the events
from the raw data at report/script/etc. time.
SAMPLE_AUX shouldn't care whether the volume is large, or how frequentI'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is neededWe took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
either. Can we use PERF_SAMPLE_AUX instead?
large volume of data needs to be captured as part of perf.data without
frequent PMIs. But proposed type is to address the capture of pipeline
PMIs are, even though it may be used in those environments.
information on each sample using PMI at periodic intervals. Hence proposingAnd that's fine for any extra bits that POWER9 has to convey
PERF_SAMPLE_PIPELINE_HAZ.
to its users beyond things already represented by other sample
types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
and other vendor e.g., AMD IBS data can be made vendor-independent
at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
what IBS currently uses.
as much vendor specific data as possible with this (like perf-mem).
isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
in patch 4/11, rather/instead just put the raw sier value in a
PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
Specific SIER capabilities can be written as part of the perf.data
header. Then synthesize the true pipe events from the raw SIER
values later, and in userspace.
Would like to stay away from SAMPLE_RAW type for these comments in perf_events.h
*ÂÂÂÂÂ #
*ÂÂÂÂÂ # The RAW record below is opaque data wrt the ABI
*ÂÂÂÂÂ #
*ÂÂÂÂÂ # That is, the ABI doesn't make any promises wrt to
*ÂÂÂÂÂ # the stability of its content, it may vary depending
*ÂÂÂÂÂ # on event, hardware, kernel version and phase of
*ÂÂÂÂÂ # the moon.
*ÂÂÂÂÂ #
*ÂÂÂÂÂ # In other words, PERF_SAMPLE_RAW contents are not an ABI.
*ÂÂÂÂÂ #
appropriate for the use-case where the raw hardware register contents
are copied directly into the user buffer.
Hi Kim,
Sorry for the delayed response.
But perf tool side needs infrastructure to handle the raw sample
data from cpu-pmu (used by tracepoints). I am not sure whether
his is the approach we should look here.
peterz any comments?
Secondly, sorry I didn't understand your suggestion about using PERF_SAMPLE_AUX.OK, I didn't think SAMPLE_AUX and the aux ring buffer were
IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory and more
challenging when correlating and presenting the pipeline details for each IP.
IMO, having a new sample type can be useful to capture the pipeline data
both in perf_sample_data and if _AUX is enabled, can be made to push to
AUX buffer.
interdependent, sorry.
Thanks,
Kim