Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

From: Ravi Bangoria
Date: Thu Mar 05 2020 - 00:06:51 EST


Hi Paul,

Sorry for bit late reply.

On 3/3/20 2:38 AM, Paul Clarke wrote:
On 3/1/20 11:23 PM, Ravi Bangoria wrote:
Most modern microprocessors employ complex instruction execution
pipelines such that many instructions can be 'in flight' at any
given point in time. Various factors affect this pipeline and
hazards are the primary among them. Different types of hazards
exist - Data hazards, Structural hazards and Control hazards.
Data hazard is the case where data dependencies exist between
instructions in different stages in the pipeline. Structural
hazard is when the same processor hardware is needed by more
than one instruction in flight at the same time. Control hazards
are more the branch misprediction kinds.

Information about these hazards are critical towards analyzing
performance issues and also to tune software to overcome such
issues. Modern processors export such hazard data in Performance
Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
AMD[3] provides similar information.

Implementation detail:

A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
If it's set, kernel converts arch specific hazard information
into generic format:

struct perf_pipeline_haz_data {
/* Instruction/Opcode type: Load, Store, Branch .... */
__u8 itype;

At the risk of bike-shedding (in an RFC, no less), "itype" doesn't convey enough meaning to me. "inst_type"? I see in 03/11, you use "perf_inst_type".

I was thinking to rename itype with operation_type or op_type. Because
AMD IBS and ARM SPE observes micro ops and also op_type is more aligned
to pipeline word.


/* Instruction Cache source */
__u8 icache;

Possibly same here, and you use "perf_inst_cache" in 03/11.

Sure.


/* Instruction suffered hazard in pipeline stage */
__u8 hazard_stage;
/* Hazard reason */
__u8 hazard_reason;
/* Instruction suffered stall in pipeline stage */
__u8 stall_stage;
/* Stall reason */
__u8 stall_reason;
__u16 pad;
};

... which can be read by user from mmap() ring buffer. With this
approach, sample perf report in hazard mode looks like (On IBM
PowerPC):

# ./perf record --hazard ./ebizzy
# ./perf report --hazard
Overhead Symbol Shared Instruction Type Hazard Stage Hazard Reason Stall Stage Stall Reason ICache access
36.58% [.] thread_run ebizzy Load LSU Mispredict LSU Load fin L1 hit
9.46% [.] thread_run ebizzy Load LSU Mispredict LSU Dcache_miss L1 hit
1.76% [.] thread_run ebizzy Fixed point - - - - L1 hit
1.31% [.] thread_run ebizzy Load LSU ERAT Miss LSU Load fin L1 hit
1.27% [.] thread_run ebizzy Load LSU Mispredict - - L1 hit
1.16% [.] thread_run ebizzy Fixed point - - FXU Fixed cycle L1 hit
0.50% [.] thread_run ebizzy Fixed point ISU Source Unavailable FXU Fixed cycle L1 hit
0.30% [.] thread_run ebizzy Load LSU LMQ Full, DERAT Miss LSU Load fin L1 hit
0.24% [.] thread_run ebizzy Load LSU ERAT Miss - - L1 hit
0.08% [.] thread_run ebizzy - - - BRU Fixed cycle L1 hit
0.05% [.] thread_run ebizzy Branch - - BRU Fixed cycle L1 hit
0.04% [.] thread_run ebizzy Fixed point ISU Source Unavailable - - L1 hit

How are these to be interpreted? This is great information, but is it possible to make it more readable for non-experts?

For the RFC proposal we just pulled the details from the spec. But yes, will
look into this.

If each of these map 1:1 with hardware events, should you emit the name of the event here, so that can be used to look up further information? For example, does the first line map to PM_CMPLU_STALL_LSU_FIN?
I'm using PM_MRK_INST_CMPL event in perf record an SIER provides all these
information.

What was "Mispredict[ed]"? (Is it different from a branch misprediction?) And how does this relate to "L1 hit"?

I'm not 100% sure. I'll check with the hw folks about it.

Can we emit "Load finish" instead of "Load fin" for easier reading? 03/11 also has "Marked fin before NTC".
Nit: why does "Dcache_miss" have an underscore and none of the others?

Sure. Will change it.


Also perf annotate with hazard data:

â static int
â compare(const void *p1, const void *p2)
â {
33.23 â std r31,-8(r1)
â {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
â {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
â {haz_stage: LSU, haz_reason: Load Hit Store, stall_stage: LSU, stall_reason: -, icache: L3 hit}
â {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
â {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
â {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
0.84 â stdu r1,-64(r1)
â {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
0.24 â mr r31,r1
â {haz_stage: -, haz_reason: -, stall_stage: -, stall_reason: -, icache: L1 hit}
21.18 â std r3,32(r31)
â {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
â {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
â {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}


Patches:
- Patch #1 is a simple cleanup patch
- Patch #2, #3, #4 implements generic and arch specific kernel
infrastructure
- Patch #5 enables perf record and script with hazard mode
- Patch #6, #7, #8 enables perf report with hazard mode
- Patch #9, #10, #11 enables perf annotate with hazard mode

Note:
- This series is based on the talk by Madhavan in LPC 2018[4]. This is
just an early RFC to get comments about the approach and not intended
to be merged yet.
- I've prepared the series base on v5.6-rc3. But it depends on generic
perf annotate fixes [5][6] which are already merged by Arnaldo in
perf/urgent and perf/core.

[1]: Book III, Section 9.4.10:
https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
[2]: https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf#G9.1106986

This document is also available from the "IBM Portal for OpenPOWER" under the "All IBM Material for OpenPOWER" https://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=OpenPOWER, under each of the individual modules. (Well hidden, it might be said, and not a simple link like you have here.)

Thanks for pointing it :)
Ravi