Re: Issue of metrics for multiple uncore PMUs (was Re: [RFC PATCH v2 23/23] perf metricgroup: remove duped metric group events)

From: John Garry
Date: Mon Oct 05 2020 - 06:06:44 EST


On 02/10/2020 21:46, Ian Rogers wrote:
On Fri, Oct 2, 2020 at 5:00 AM John Garry <john.garry@xxxxxxxxxx> wrote:

On 07/05/2020 15:08, Ian Rogers wrote:

Hi Ian,

I was wondering if you ever tested commit 2440689d62e9 ("perf
metricgroup: Remove duped metric group events") for when we have a
metric which aliases multiple instances of the same uncore PMU in the
system?

Sorry for this, I hadn't tested such a metric and wasn't aware of how
the aliasing worked. I sent a fix for this issue here:
https://lore.kernel.org/lkml/20200917201807.4090224-1-irogers@xxxxxxxxxx/
Could you see if this addresses the issue for you? I don't see the
change in Arnaldo's trees yet.

Unfortunately this does not seem to fix my issue.

So for that patch, you say you fix metric expression for DRAM_BW_Use, which is:

{
"BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]",
"MetricExpr": "( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / 1000000000 ) / duration_time",
"MetricGroup": "Memory_BW",
"MetricName": "DRAM_BW_Use"
},

But this metric expression does not include any alias events; rather I think it is just cas_count_write + cas_count_read event count for PMU uncore_imc / duration_time.

When I say alias, I mean - as an example, we have event:

{
"BriefDescription": "write requests to memory controller. Derived from unc_m_cas_count.wr",
"Counter": "0,1,2,3",
"EventCode": "0x4",
"EventName": "LLC_MISSES.MEM_WRITE",
"PerPkg": "1",
"ScaleUnit": "64Bytes",
"UMask": "0xC",
"Unit": "iMC"
},

And then reference LLC_MISSES.MEM_WRITE in a metric expression:

"MetricExpr": "LLC_MISSES.MEM_WRITE / duration_time",

This is what seems to be broken for when the alias matches > 1 PMU.

Please check this.

Thanks,
John


Thanks,
Ian

I have been rebasing some of my arm64 perf work to v5.9-rc7, and find an
issue where find_evsel_group() fails for the uncore metrics under the
condition mentioned above.

Unfortunately I don't have an x86 machine to which this test applies.
However, as an experiment, I added a test metric to my broadwell JSON:

diff --git a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
index 8cdc7c13dc2a..fc6d9adf996a 100644
--- a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
@@ -348,5 +348,11 @@
"MetricExpr": "(cstate_pkg@c7\\-residency@ / msr@tsc@) * 100",
"MetricGroup": "Power",
"MetricName": "C7_Pkg_Residency"
+ },
+ {
+ "BriefDescription": "test metric",
+ "MetricExpr": "UNC_CBO_XSNP_RESPONSE.MISS_XCORE *
UNC_CBO_XSNP_RESPONSE.MISS_EVICTION",
+ "MetricGroup": "Test",
+ "MetricName": "test_metric_inc"
}
]


And get this:

john@localhost:~/linux/tools/perf> sudo ./perf stat -v -M
test_metric_inc sleep 1
Using CPUID GenuineIntel-6-3D-4
metric expr unc_cbo_xsnp_response.miss_xcore *
unc_cbo_xsnp_response.miss_eviction for test_metric_inc
found event unc_cbo_xsnp_response.miss_eviction
found event unc_cbo_xsnp_response.miss_xcore
adding
{unc_cbo_xsnp_response.miss_eviction,unc_cbo_xsnp_response.miss_xcore}:W
unc_cbo_xsnp_response.miss_eviction -> uncore_cbox_1/umask=0x81,event=0x22/
unc_cbo_xsnp_response.miss_eviction -> uncore_cbox_0/umask=0x81,event=0x22/
unc_cbo_xsnp_response.miss_xcore -> uncore_cbox_1/umask=0x41,event=0x22/
unc_cbo_xsnp_response.miss_xcore -> uncore_cbox_0/umask=0x41,event=0x22/
Cannot resolve test_metric_inc: unc_cbo_xsnp_response.miss_xcore *
unc_cbo_xsnp_response.miss_eviction
task-clock: 688876 688876 688876
context-switches: 2 688876 688876
cpu-migrations: 0 688876 688876
page-faults: 69 688876 688876
cycles: 2101719 695690 695690
instructions: 1180534 695690 695690
branches: 249450 695690 695690
branch-misses: 10815 695690 695690

Performance counter stats for 'sleep 1':

0.69 msec task-clock # 0.001 CPUs
utilized
2 context-switches # 0.003 M/sec

0 cpu-migrations # 0.000 K/sec

69 page-faults # 0.100 M/sec

2,101,719 cycles # 3.051 GHz

1,180,534 instructions # 0.56 insn per
cycle
249,450 branches # 362.112 M/sec

10,815 branch-misses # 4.34% of all
branches

1.001177693 seconds time elapsed

0.001149000 seconds user
0.000000000 seconds sys


john@localhost:~/linux/tools/perf>


Any idea what is going wrong here, before I have to dive in? The issue
seems to be this named commit.

Thanks,
John

A metric group contains multiple metrics. These metrics may use the same
events. If metrics use separate events then it leads to more
multiplexing and overall metric counts fail to sum to 100%.
Modify how metrics are associated with events so that if the events in
an earlier group satisfy the current metric, the same events are used.
A record of used events is kept and at the end of processing unnecessary
events are eliminated.

Before:
.