Re: [LKP] Re: [perf vendor events] 3f5f0df7bf: perf-sanity-tests.perf_all_metrics_test.fail

From: Ian Rogers
Date: Thu Apr 14 2022 - 12:39:00 EST


On Wed, Apr 13, 2022 at 11:17 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>
>
>
> On 4/13/2022 1:09 PM, Ian Rogers wrote:
> > On Wed, Apr 13, 2022 at 9:37 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 4/13/2022 12:03 PM, Ian Rogers wrote:
> >>> 3) Weak group doesn't fall back to no group:
> >>
> >> That's because the group validation code doesn't take pinned events,
> >> such as the NMI watchdog, into account.
> >>
> >> I proposed a kernel patch to fix it, but it's rejected. It should be
> >> hard to find a generic way to fix it from the kernel side.
> >> https://lore.kernel.org/lkml/1565977750-76693-1-git-send-email-kan.liang@xxxxxxxxxxxxxxx/
> >>
> >> Maybe we can workaround it from the perf tool side?
> >> For example, for each weak group with cycles event and NMI watchdog is
> >> enabled, add an extra cycles event when opening the group. If the open
> >> fails with the extra cycles event, fall back to no group. After the
> >> extra cycles event check, remove the extra cycles.
> >>
> >> What do you think?
> >
> > Thanks Kan, it is a shame the kernel support is lacking here. I'm not
> > sure what you are proposing for the perf tool to do. So:
> >
> >> for each weak group with cycles event and NMI watchdog
> >
> > Okay, let's try Branching_Overhead as mentioned in this report - but
> > the event is CPU_CLK_UNHALTED.THREAD here :-/
> >
> >> add an extra cycles event when opening the group
> >
> > So the perf_event_open doesn't fail here for me:
> > $ perf stat -e '{BR_INST_RETIRED.NEAR_CALL,BR_INST_RETIRED.NEAR_TAKEN,BR_INST_RETIRED.NOT_TAKEN,BR_INST_RETIRED.CONDITIONAL,CPU_CLK_UNHALTED.THREAD},cycles'
> > -a sleep 1
> >
>
> No, I mean modifying the perf tool code and add an extra cycles in the
> weak group.
>
> Here is a very initial POC patch, which should prove the idea.

So I was unaware of this behavior, great find! However, it seems
difficult to exploit. Here is the extra cycles "fixing" a weak group:
```
$ perf stat -e '{BR_INST_RETIRED.NEAR_CALL,BR_INST_RETIRED.NEAR_TAKEN,BR_INST_RETIRED.NOT_TAKEN,BR_INST_RETIRED.CONDITIONAL,cycles,cycles}:W'
-a sleep 1

Performance counter stats for 'system wide':

18,782,301 BR_INST_RETIRED.NEAR_CALL
(66.64%)
153,325,072 BR_INST_RETIRED.NEAR_TAKEN
(66.64%)
75,443,263 BR_INST_RETIRED.NOT_TAKEN
(66.64%)
156,568,769 BR_INST_RETIRED.CONDITIONAL
(66.66%)
1,870,812,571 cycles
(66.72%)
1,890,508,326 cycles
(66.70%)

1.006371081 seconds time elapsed
```

But if the original group has 1 less counter we will fail with the
duplicate cycles:
```
$ perf stat -e '{BR_INST_RETIRED.NEAR_CALL,BR_INST_RETIRED.NEAR_TAKEN,BR_INST_RETIRED.NOT_TAKEN,cycles,cycles}:W'
-a sleep 1
Performance counter stats for 'system wide':

<not counted> BR_INST_RETIRED.NEAR_CALL
(0.00%)
<not counted> BR_INST_RETIRED.NEAR_TAKEN
(0.00%)
<not counted> BR_INST_RETIRED.NOT_TAKEN
(0.00%)
<not counted> cycles
(0.00%)
<not counted> cycles
(0.00%)

1.005599088 seconds time elapsed

Some events weren't counted. Try disabling the NMI watchdog:
echo 0 > /proc/sys/kernel/nmi_watchdog
perf stat ...
echo 1 > /proc/sys/kernel/nmi_watchdog
The events in group usually have to be from the same PMU. Try
reorganizing the group.
```

If we add two extra cycles or the original group is smaller then it is "fixed":
```
$ perf stat -e '{BR_INST_RETIRED.NEAR_CALL,BR_INST_RETIRED.NEAR_TAKEN,BR_INST_RETIRED.NOT_TAKEN,cycles}:W'
-a sleep 1

Performance counter stats for 'system wide':

20,378,789 BR_INST_RETIRED.NEAR_CALL
168,420,963 BR_INST_RETIRED.NEAR_TAKEN
96,330,608 BR_INST_RETIRED.NOT_TAKEN
1,652,230,042 cycles

1.008757590 seconds time elapsed

$ perf stat -e '{BR_INST_RETIRED.NEAR_CALL,BR_INST_RETIRED.NEAR_TAKEN,BR_INST_RETIRED.NOT_TAKEN,cycles,cycles,cycles}:W'
-a sleep 1

Performance counter stats for 'system wide':

37,696,638 BR_INST_RETIRED.NEAR_CALL
(66.62%)
298,535,151 BR_INST_RETIRED.NEAR_TAKEN
(66.63%)
297,011,663 BR_INST_RETIRED.NOT_TAKEN
(66.63%)
3,155,711,474 cycles
(66.65%)
3,194,919,959 cycles
(66.74%)
3,126,664,102 cycles
(66.72%)

1.006237962 seconds time elapsed
```

So the extra cycles is needed to fix weak groups when the nmi watchdog
is enabled and the group is an architecture dependent size.

Thanks,
Ian

> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index b7fe88beb584..782c3d7f1b32 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -71,7 +71,9 @@
> #include "util/bpf_counter.h"
> #include "util/iostat.h"
> #include "util/pmu-hybrid.h"
> +#include "util/util.h"
> #include "asm/bug.h"
> +#include "perf-sys.h"
>
> #include <linux/time64.h>
> #include <linux/zalloc.h>
> @@ -777,6 +779,8 @@ static enum counter_recovery
> stat_handle_error(struct evsel *counter)
> return COUNTER_FATAL;
> }
>
> +#define FD(e, x, y) (*(int *)xyarray__entry(e->core.fd, x, y))
> +
> static int __run_perf_stat(int argc, const char **argv, int run_idx)
> {
> int interval = stat_config.interval;
> @@ -793,6 +797,7 @@ static int __run_perf_stat(int argc, const char
> **argv, int run_idx)
> struct affinity saved_affinity, *affinity = NULL;
> int err;
> bool second_pass = false;
> + bool has_cycles = false;
>
> if (forks) {
> if (evlist__prepare_workload(evsel_list, &target, argv, is_pipe,
> workload_exec_failed_signal) < 0) {
> @@ -821,6 +826,8 @@ static int __run_perf_stat(int argc, const char
> **argv, int run_idx)
> evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
> counter = evlist_cpu_itr.evsel;
>
> + if (counter->core.attr.config == 0x3c)
> + has_cycles = true;
> /*
> * bperf calls evsel__open_per_cpu() in bperf__load(), so
> * no need to call it again here.
> @@ -867,6 +874,24 @@ static int __run_perf_stat(int argc, const char
> **argv, int run_idx)
> counter->supported = true;
> }
>
> + //make it model specific. need to move to a better place
> + if (counter->supported && !second_pass && has_cycles &&
> counter->weak_group && sysctl__nmi_watchdog_enabled()) {
> + struct evsel *leader = evsel__leader(counter);
> + int group_fd = FD(leader, 0, 0);
> + struct evsel *evsel;
> + int fd;
> +
> + evsel = evsel__new_cycles(0, PERF_TYPE_HARDWARE,
> PERF_COUNT_HW_CPU_CYCLES);
> + fd = sys_perf_event_open(&evsel->core.attr, -1, 0, group_fd, 0x8);
> +
> + if (fd < 0) {
> + evlist__reset_weak_group(evsel_list, counter, false);
> + second_pass = true;
> + } else {
> + evsel__close(evsel);
> + }
> + }
> +
> if (second_pass) {
> /*
> * Now redo all the weak group after closing them,
>
> With the above patch,
>
> $ sudo ./perf stat -e
> '{BR_INST_RETIRED.NEAR_CALL,BR_INST_RETIRED.NEAR_TAKEN,BR_INST_RETIRED.NOT_TAKEN,BR_INST_RETIRED.CONDITIONAL,CPU_CLK_UNHALTED.THREAD}:W'
> -C0 sleep 1
>
> Performance counter stats for 'CPU(s) 0':
>
> 913,369 BR_INST_RETIRED.NEAR_CALL
> (79.95%)
> 3,648,433 BR_INST_RETIRED.NEAR_TAKEN
> (80.00%)
> 2,481,976 BR_INST_RETIRED.NOT_TAKEN
> (80.05%)
> 3,696,298 BR_INST_RETIRED.CONDITIONAL
> (80.04%)
> 27,792,053 CPU_CLK_UNHALTED.THREAD
> (79.96%)
>
> 1.002224709 seconds time elapsed
>
>
> Thanks,
> Kan