Re: [PATCH 49/49] perf evsel: Adjust hybrid event and global event mixed group

From: Jin, Yao
Date: Mon Feb 08 2021 - 19:50:03 EST


Hi Arnaldo,

On 2/9/2021 3:12 AM, Arnaldo Carvalho de Melo wrote:
Em Mon, Feb 08, 2021 at 07:25:46AM -0800, kan.liang@xxxxxxxxxxxxxxx escreveu:
From: Jin Yao <yao.jin@xxxxxxxxxxxxxxx>

A group mixed with hybrid event and global event is allowed. For example,
group leader is 'cpu-clock' and the group member is 'cpu_atom/cycles/'.

e.g.
perf stat -e '{cpu-clock,cpu_atom/cycles/}' -a

The challenge is their available cpus are not fully matched.
For example, 'cpu-clock' is available on CPU0-CPU23, but 'cpu_core/cycles/'
is available on CPU16-CPU23.

When getting the group id for group member, we must be very careful
because the cpu for 'cpu-clock' is not equal to the cpu for 'cpu_atom/cycles/'.
Actually the cpu here is the index of evsel->core.cpus, not the real CPU ID.
e.g. cpu0 for 'cpu-clock' is CPU0, but cpu0 for 'cpu_atom/cycles/' is CPU16.

Another challenge is for group read. The events in group may be not
available on all cpus. For example the leader is a software event and
it's available on CPU0-CPU1, but the group member is a hybrid event and
it's only available on CPU1. For CPU0, we have only one event, but for CPU1
we have two events. So we need to change the read size according to
the real number of events on that cpu.

Let's see examples,

root@otcpl-adl-s-2:~# ./perf stat -e '{cpu-clock,cpu_atom/cycles/}' -a -vvv -- sleep 1
Control descriptor is not initialized
------------------------------------------------------------
perf_event_attr:
type 1
size 120
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
disabled 1
inherit 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 3
sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 4
sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 = 5
sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 = 7
sys_perf_event_open: pid -1 cpu 4 group_fd -1 flags 0x8 = 8
sys_perf_event_open: pid -1 cpu 5 group_fd -1 flags 0x8 = 9
sys_perf_event_open: pid -1 cpu 6 group_fd -1 flags 0x8 = 10
sys_perf_event_open: pid -1 cpu 7 group_fd -1 flags 0x8 = 11
sys_perf_event_open: pid -1 cpu 8 group_fd -1 flags 0x8 = 12
sys_perf_event_open: pid -1 cpu 9 group_fd -1 flags 0x8 = 13
sys_perf_event_open: pid -1 cpu 10 group_fd -1 flags 0x8 = 14
sys_perf_event_open: pid -1 cpu 11 group_fd -1 flags 0x8 = 15
sys_perf_event_open: pid -1 cpu 12 group_fd -1 flags 0x8 = 16
sys_perf_event_open: pid -1 cpu 13 group_fd -1 flags 0x8 = 17
sys_perf_event_open: pid -1 cpu 14 group_fd -1 flags 0x8 = 18
sys_perf_event_open: pid -1 cpu 15 group_fd -1 flags 0x8 = 19
sys_perf_event_open: pid -1 cpu 16 group_fd -1 flags 0x8 = 20
sys_perf_event_open: pid -1 cpu 17 group_fd -1 flags 0x8 = 21
sys_perf_event_open: pid -1 cpu 18 group_fd -1 flags 0x8 = 22
sys_perf_event_open: pid -1 cpu 19 group_fd -1 flags 0x8 = 23
sys_perf_event_open: pid -1 cpu 20 group_fd -1 flags 0x8 = 24
sys_perf_event_open: pid -1 cpu 21 group_fd -1 flags 0x8 = 25
sys_perf_event_open: pid -1 cpu 22 group_fd -1 flags 0x8 = 26
sys_perf_event_open: pid -1 cpu 23 group_fd -1 flags 0x8 = 27
------------------------------------------------------------
perf_event_attr:
type 6
size 120
config 0xa00000000
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
inherit 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 16 group_fd 20 flags 0x8 = 28
sys_perf_event_open: pid -1 cpu 17 group_fd 21 flags 0x8 = 29
sys_perf_event_open: pid -1 cpu 18 group_fd 22 flags 0x8 = 30
sys_perf_event_open: pid -1 cpu 19 group_fd 23 flags 0x8 = 31
sys_perf_event_open: pid -1 cpu 20 group_fd 24 flags 0x8 = 32
sys_perf_event_open: pid -1 cpu 21 group_fd 25 flags 0x8 = 33
sys_perf_event_open: pid -1 cpu 22 group_fd 26 flags 0x8 = 34
sys_perf_event_open: pid -1 cpu 23 group_fd 27 flags 0x8 = 35
cpu-clock: 0: 1001661765 1001663044 1001663044
cpu-clock: 1: 1001659407 1001659885 1001659885
cpu-clock: 2: 1001646087 1001647302 1001647302
cpu-clock: 3: 1001645168 1001645550 1001645550
cpu-clock: 4: 1001645052 1001646102 1001646102
cpu-clock: 5: 1001643719 1001644472 1001644472
cpu-clock: 6: 1001641893 1001642859 1001642859
cpu-clock: 7: 1001640524 1001641036 1001641036
cpu-clock: 8: 1001637596 1001638076 1001638076
cpu-clock: 9: 1001638121 1001638200 1001638200
cpu-clock: 10: 1001635825 1001636915 1001636915
cpu-clock: 11: 1001633722 1001634276 1001634276
cpu-clock: 12: 1001687133 1001686941 1001686941
cpu-clock: 13: 1001693663 1001693317 1001693317
cpu-clock: 14: 1001693381 1001694407 1001694407
cpu-clock: 15: 1001691865 1001692321 1001692321
cpu-clock: 16: 1001696621 1001696550 1001696550
cpu-clock: 17: 1001699963 1001699822 1001699822
cpu-clock: 18: 1001701938 1001701850 1001701850
cpu-clock: 19: 1001699298 1001699214 1001699214
cpu-clock: 20: 1001691550 1001691026 1001691026
cpu-clock: 21: 1001688348 1001688212 1001688212
cpu-clock: 22: 1001684907 1001684799 1001684799
cpu-clock: 23: 1001680840 1001680780 1001680780
cycles: 0: 28175 1001696550 1001696550
cycles: 1: 403323 1001699822 1001699822
cycles: 2: 35905 1001701850 1001701850
cycles: 3: 36755 1001699214 1001699214
cycles: 4: 33757 1001691026 1001691026
cycles: 5: 37146 1001688212 1001688212
cycles: 6: 35483 1001684799 1001684799
cycles: 7: 38600 1001680780 1001680780
cpu-clock: 24040038386 24040046956 24040046956
cycles: 649144 8013542253 8013542253

Performance counter stats for 'system wide':

24,040.04 msec cpu-clock # 23.976 CPUs utilized
649,144 cycles [cpu_atom] # 0.027 M/sec

1.002683706 seconds time elapsed

For cpu_atom/cycles/, cpu16-cpu23 are set with valid group fd (cpu-clock's fd
on that cpu). For counting results, cpu-clock has 24 cpus aggregation and
cpu_atom/cycles/ has 8 cpus aggregation. That's expected.

But if the event order is changed, e.g. '{cpu_atom/cycles/,cpu-clock}',
there leaves more works to do.

root@otcpl-adl-s-2:~# ./perf stat -e '{cpu_atom/cycles/,cpu-clock}' -a -vvv -- sleep 1
Control descriptor is not initialized
------------------------------------------------------------
perf_event_attr:
type 6
size 120
config 0xa00000000
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
disabled 1
inherit 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 16 group_fd -1 flags 0x8 = 3
sys_perf_event_open: pid -1 cpu 17 group_fd -1 flags 0x8 = 4
sys_perf_event_open: pid -1 cpu 18 group_fd -1 flags 0x8 = 5
sys_perf_event_open: pid -1 cpu 19 group_fd -1 flags 0x8 = 7
sys_perf_event_open: pid -1 cpu 20 group_fd -1 flags 0x8 = 8
sys_perf_event_open: pid -1 cpu 21 group_fd -1 flags 0x8 = 9
sys_perf_event_open: pid -1 cpu 22 group_fd -1 flags 0x8 = 10
sys_perf_event_open: pid -1 cpu 23 group_fd -1 flags 0x8 = 11
------------------------------------------------------------
perf_event_attr:
type 1
size 120
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
inherit 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 12
sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 13
sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 = 14
sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 = 15
sys_perf_event_open: pid -1 cpu 4 group_fd -1 flags 0x8 = 16
sys_perf_event_open: pid -1 cpu 5 group_fd -1 flags 0x8 = 17
sys_perf_event_open: pid -1 cpu 6 group_fd -1 flags 0x8 = 18
sys_perf_event_open: pid -1 cpu 7 group_fd -1 flags 0x8 = 19
sys_perf_event_open: pid -1 cpu 8 group_fd -1 flags 0x8 = 20
sys_perf_event_open: pid -1 cpu 9 group_fd -1 flags 0x8 = 21
sys_perf_event_open: pid -1 cpu 10 group_fd -1 flags 0x8 = 22
sys_perf_event_open: pid -1 cpu 11 group_fd -1 flags 0x8 = 23
sys_perf_event_open: pid -1 cpu 12 group_fd -1 flags 0x8 = 24
sys_perf_event_open: pid -1 cpu 13 group_fd -1 flags 0x8 = 25
sys_perf_event_open: pid -1 cpu 14 group_fd -1 flags 0x8 = 26
sys_perf_event_open: pid -1 cpu 15 group_fd -1 flags 0x8 = 27
sys_perf_event_open: pid -1 cpu 16 group_fd 3 flags 0x8 = 28
sys_perf_event_open: pid -1 cpu 17 group_fd 4 flags 0x8 = 29
sys_perf_event_open: pid -1 cpu 18 group_fd 5 flags 0x8 = 30
sys_perf_event_open: pid -1 cpu 19 group_fd 7 flags 0x8 = 31
sys_perf_event_open: pid -1 cpu 20 group_fd 8 flags 0x8 = 32
sys_perf_event_open: pid -1 cpu 21 group_fd 9 flags 0x8 = 33
sys_perf_event_open: pid -1 cpu 22 group_fd 10 flags 0x8 = 34
sys_perf_event_open: pid -1 cpu 23 group_fd 11 flags 0x8 = 35
cycles: 0: 422260 1001993637 1001993637
cycles: 1: 631309 1002039934 1002039934
cycles: 2: 309501 1002018065 1002018065
cycles: 3: 119279 1002040811 1002040811
cycles: 4: 89389 1002039312 1002039312
cycles: 5: 155437 1002054794 1002054794
cycles: 6: 92420 1002051141 1002051141
cycles: 7: 96017 1002073659 1002073659
cpu-clock: 0: 0 0 0
cpu-clock: 1: 0 0 0
cpu-clock: 2: 0 0 0
cpu-clock: 3: 0 0 0
cpu-clock: 4: 0 0 0
cpu-clock: 5: 0 0 0
cpu-clock: 6: 0 0 0
cpu-clock: 7: 0 0 0
cpu-clock: 8: 0 0 0
cpu-clock: 9: 0 0 0
cpu-clock: 10: 0 0 0
cpu-clock: 11: 0 0 0
cpu-clock: 12: 0 0 0
cpu-clock: 13: 0 0 0
cpu-clock: 14: 0 0 0
cpu-clock: 15: 0 0 0
cpu-clock: 16: 1001997706 1001993637 1001993637
cpu-clock: 17: 1002040524 1002039934 1002039934
cpu-clock: 18: 1002018570 1002018065 1002018065
cpu-clock: 19: 1002041360 1002040811 1002040811
cpu-clock: 20: 1002044731 1002039312 1002039312
cpu-clock: 21: 1002055355 1002054794 1002054794
cpu-clock: 22: 1002051659 1002051141 1002051141
cpu-clock: 23: 1002074150 1002073659 1002073659
cycles: 1915612 8016311353 8016311353
cpu-clock: 8016324055 8016311353 8016311353

Performance counter stats for 'system wide':

1,915,612 cycles [cpu_atom] # 0.239 M/sec

I suggested having something like this in a previous patch, when
creating two 'instructions', etc events, one for cpu_atom and the other
for cpu_atom, perhaps even use with the PMU style, i.e.

1,915,612 cpu_atom/cycles/ # 0.239 M/sec


OK, I will move this function to previous patch.

For "cycles [cpu_atom]" style, we don't need more code, just set 'stat_config.no_merge = true'.

For "cpu_atom/cycles/" style, please let me think about it.

Thanks
Jin Yao

8,016.32 msec cpu-clock # 7.996 CPUs utilized

1.002545027 seconds time elapsed

For cpu-clock, cpu16-cpu23 are set with valid group fd (cpu_atom/cycles/'s
fd on that cpu). For counting results, cpu_atom/cycles/ has 8 cpus aggregation
, that's correct. But for cpu-clock, it also has 8 cpus aggregation
(cpu16-cpu23, not all cpus), the code should be improved. Now one warning
is displayed: "WARNING: for cpu-clock, some CPU counts not read".

Reviewed-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
Signed-off-by: Jin Yao <yao.jin@xxxxxxxxxxxxxxx>
---
tools/perf/util/evsel.c | 105 +++++++++++++++++++++++++++++++++++++++++++++---
tools/perf/util/stat.h | 1 +
2 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 61508cf..65c8cfc8 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1453,15 +1453,26 @@ static void evsel__set_count(struct evsel *counter, int cpu, int thread, u64 val
perf_counts__set_loaded(counter->counts, cpu, thread, true);
}
-static int evsel__process_group_data(struct evsel *leader, int cpu, int thread, u64 *data)
+static int evsel_cpuid_match(struct evsel *evsel1, struct evsel *evsel2,
+ int cpu)
+{
+ int cpuid;
+
+ cpuid = perf_cpu_map__cpu(evsel1->core.cpus, cpu);
+ return perf_cpu_map__idx(evsel2->core.cpus, cpuid);
+}
+
+static int evsel__process_group_data(struct evsel *leader, int cpu, int thread,
+ u64 *data, int nr_members)
{
u64 read_format = leader->core.attr.read_format;
struct sample_read_value *v;
u64 nr, ena = 0, run = 0, i;
+ int idx;
nr = *data++;
- if (nr != (u64) leader->core.nr_members)
+ if (nr != (u64) nr_members)
return -EINVAL;
if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
@@ -1481,24 +1492,85 @@ static int evsel__process_group_data(struct evsel *leader, int cpu, int thread,
if (!counter)
return -EINVAL;
- evsel__set_count(counter, cpu, thread, v[i].value, ena, run);
+ if (evsel__is_hybrid_event(counter) ||
+ evsel__is_hybrid_event(leader)) {
+ idx = evsel_cpuid_match(leader, counter, cpu);
+ if (idx == -1)
+ return -EINVAL;
+ } else
+ idx = cpu;
+
+ evsel__set_count(counter, idx, thread, v[i].value, ena, run);
}
return 0;
}
+static int hybrid_read_size(struct evsel *leader, int cpu, int *nr_members)
+{
+ struct evsel *pos;
+ int nr = 1, back, new_size = 0, idx;
+
+ for_each_group_member(pos, leader) {
+ idx = evsel_cpuid_match(leader, pos, cpu);
+ if (idx != -1)
+ nr++;
+ }
+
+ if (nr != leader->core.nr_members) {
+ back = leader->core.nr_members;
+ leader->core.nr_members = nr;
+ new_size = perf_evsel__read_size(&leader->core);
+ leader->core.nr_members = back;
+ }
+
+ *nr_members = nr;
+ return new_size;
+}
+
static int evsel__read_group(struct evsel *leader, int cpu, int thread)
{
struct perf_stat_evsel *ps = leader->stats;
u64 read_format = leader->core.attr.read_format;
int size = perf_evsel__read_size(&leader->core);
+ int new_size, nr_members;
u64 *data = ps->group_data;
if (!(read_format & PERF_FORMAT_ID))
return -EINVAL;
- if (!evsel__is_group_leader(leader))
+ if (!evsel__is_group_leader(leader)) {
+ if (evsel__is_hybrid_event(leader->leader) &&
+ !evsel__is_hybrid_event(leader)) {
+ /*
+ * The group leader is hybrid event and it's
+ * only available on part of cpus. But the group
+ * member are available on all cpus. TODO:
+ * read the counts on the rest of cpus for group
+ * member.
+ */
+ WARN_ONCE(1, "WARNING: for %s, some CPU counts "
+ "not read\n", leader->name);
+ return 0;
+ }
return -EINVAL;
+ }
+
+ /*
+ * For example the leader is a software event and it's available on
+ * cpu0-cpu1, but the group member is a hybrid event and it's only
+ * available on cpu1. For cpu0, we have only one event, but for cpu1
+ * we have two events. So we need to change the read size according to
+ * the real number of events on a given cpu.
+ */
+ new_size = hybrid_read_size(leader, cpu, &nr_members);
+ if (new_size)
+ size = new_size;
+
+ if (ps->group_data && ps->group_data_size < size) {
+ zfree(&ps->group_data);
+ data = NULL;
+ }
if (!data) {
data = zalloc(size);
@@ -1506,6 +1578,7 @@ static int evsel__read_group(struct evsel *leader, int cpu, int thread)
return -ENOMEM;
ps->group_data = data;
+ ps->group_data_size = size;
}
if (FD(leader, cpu, thread) < 0)
@@ -1514,7 +1587,7 @@ static int evsel__read_group(struct evsel *leader, int cpu, int thread)
if (readn(FD(leader, cpu, thread), data, size) <= 0)
return -errno;
- return evsel__process_group_data(leader, cpu, thread, data);
+ return evsel__process_group_data(leader, cpu, thread, data, nr_members);
}
int evsel__read_counter(struct evsel *evsel, int cpu, int thread)
@@ -1561,6 +1634,28 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
*/
BUG_ON(!leader->core.fd);
+ /*
+ * If leader is not hybrid event, it's available on
+ * all cpus (e.g. software event). But hybrid evsel
+ * member is only available on part of cpus. So need
+ * to get the leader's fd from correct cpu.
+ */
+ if (evsel__is_hybrid_event(evsel) &&
+ !evsel__is_hybrid_event(leader)) {
+ cpu = evsel_cpuid_match(evsel, leader, cpu);
+ BUG_ON(cpu == -1);
+ }
+
+ /*
+ * Leader is hybrid event but member is global event.
+ */
+ if (!evsel__is_hybrid_event(evsel) &&
+ evsel__is_hybrid_event(leader)) {
+ cpu = evsel_cpuid_match(evsel, leader, cpu);
+ if (cpu == -1)
+ return -1;
+ }
+
fd = FD(leader, cpu, thread);
BUG_ON(fd == -1);
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 80f6715..b96168c 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -46,6 +46,7 @@ struct perf_stat_evsel {
struct stats res_stats[3];
enum perf_stat_evsel_id id;
u64 *group_data;
+ int group_data_size;
};
enum aggr_mode {
--
2.7.4