Re: [PATCH 2/6] perf: Support branch events logging
From: Peter Zijlstra
Date: Fri Apr 14 2023 - 10:53:45 EST
On Fri, Apr 14, 2023 at 09:35:37AM -0400, Liang, Kan wrote:
>
>
> On 2023-04-14 6:38 a.m., Peter Zijlstra wrote:
> > On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@xxxxxxxxxxxxxxx wrote:
> >> From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
> >>
> >> With the cycle time information between branches, stalls can be easily
> >> observed. But it's difficult to explain what causes the long delay.
> >>
> >> Add a new field to collect the occurrences of events since the last
> >> branch entry, which can be used to provide some causality information
> >> for the cycle time values currently recorded in branches.
> >>
> >> Add a new branch sample type to indicate whether include occurrences of
> >> events in branch info.
> >>
> >> Only support up to 4 events with saturating at value 3.
> >> In the current kernel, the events are ordered by either the counter
> >> index or the enabling sequence. But none of the order information is
> >> available to the user space tool.
> >> Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic
> >> support to dump the event IDs of the branch events.
> >> Add a helper function to detect the branch event flag.
> >> These will be used in the following patch.
> >
> > I'm having trouble reverse engineering this. Can you more coherently
> > explain this feature and how you've implemented it?
>
> Sorry for that.
>
> The feature is an enhancement of ARCH LBR. It adds new fields in the
> LBR_INFO MSRs to log the occurrences of events on the first 4 GP
> counters. Worked with the previous timed LBR feature together, the user
> can understand not only the latency between two LBR blocks, but also
> which events causes the stall.
>
> The spec can be found at the latest Intel® Architecture Instruction Set
> Extensions and Future Features, v048. Chapter 8.4.
> https://cdrdv2.intel.com/v1/dl/getContent/671368
Oh gawd; that's terse. Why can't these people write comprehensible
things :/ It's almost as if they don't want this stuff to be used.
So IA32_LBR_x_INFO is extended:
[0:15] CYC_CNT
[16:31] undefined
+ [32:33] PMC0_CNT
+ [34:35] PMC1_CNT
+ [36:37] PMC2_CNT
+ [38:39] PMC3_CNT
+ [40:41] PMC4_CNT
+ [42:43] PMC5_CNT
+ [44:45] PMC6_CNT
+ [46:47] PMC7_CNT
[48:55] undefined
[56:59] BR_TYPE
[60] CYC_CNT_VALID
[61] TSX_ABORT
Where the PMCx_CNT fields are saturating counters for the respective
PMCs. And we'll run out of bits if we get more than 12 PMCs. Is SMT=n
PMC merging still a thing?
And for some reason this counting is enabled in PERFEVTSELx[35] instead
of in LBR_CTL somewhere :/
> To support the feature, there are three main changes in ABIs.
> - A new branch sample type, PERF_SAMPLE_BRANCH_EVENT, is used as a knob
> to enable the feature.
> - Extend the struct perf_branch_entry layout, because we have to save
> and pass the occurrences of events to user space. Since it's only
> available for 4 counters and saturating at value 3, it only occupies 8
> bits. For the current Intel implementation, the order is the order of
> counters.
Only for 4? Where does it say that? If it were to only support 4, then
we're in counter scheduling contraint hell again and we need to somehow
group all these things together with the LBR event.
@@ -1410,6 +1423,10 @@ union perf_mem_data_src {
* cycles: cycles from last branch (or 0 if not supported)
* type: branch type
* spec: branch speculation info (or 0 if not supported)
+ * events: occurrences of events since the last branch entry.
+ * The fields can store up to 4 events with saturating
+ * at value 3.
+ * (or 0 if not supported)
*/
struct perf_branch_entry {
__u64 from;
@@ -1423,7 +1440,8 @@ struct perf_branch_entry {
spec:2, /* branch speculation info */
new_type:4, /* additional branch type */
priv:3, /* privilege level */
- reserved:31;
+ events:8, /* occurrences of events since the last branch entry */
+ reserved:23;
};
union perf_sample_weight {
This seems properly terrible from an interface pov. What if the next
generation of silicon extends this to all 8 PMCs or another architecture
comes along that does this with 3 bits per counter etc...
> - Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, to dump
> the order information. User space tool doesn't understand the order of
> counters. So it cannot map the new fields in struct perf_branch_entry to
> a specific event. We have to dump the order information.
Sorry; I can't parse this.