Re: [PATCH v2] perf stat: Introduce skippable evsels
From: Liang, Kan
Date: Mon Apr 17 2023 - 13:31:27 EST
On 2023-04-17 11:59 a.m., Ian Rogers wrote:
> On Mon, Apr 17, 2023 at 6:58 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On 2023-04-14 7:03 p.m., Ian Rogers wrote:
>>> On Fri, Apr 14, 2023 at 11:07 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> On 2023-04-14 1:19 a.m., Ian Rogers wrote:
>>>>> Perf stat with no arguments will use default events and metrics. These
>>>>> events may fail to open even with kernel and hypervisor disabled. When
>>>>> these fail then the permissions error appears even though they were
>>>>> implicitly selected. This is particularly a problem with the automatic
>>>>> selection of the TopdownL1 metric group on certain architectures like
>>>>> Skylake:
>>>>>
>>>>> ```
>>>>> $ perf stat true
>>>>> Error:
>>>>> Access to performance monitoring and observability operations is limited.
>>>>> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
>>>>> access to performance monitoring and observability operations for processes
>>>>> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
>>>>> More information can be found at 'Perf events and tool security' document:
>>>>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
>>>>> perf_event_paranoid setting is 2:
>>>>> -1: Allow use of (almost) all events by all users
>>>>> Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>>>>>> = 0: Disallow raw and ftrace function tracepoint access
>>>>>> = 1: Disallow CPU event access
>>>>>> = 2: Disallow kernel profiling
>>>>> To make the adjusted perf_event_paranoid setting permanent preserve it
>>>>> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
>>>>> ```
>>>>>
>>>>> This patch adds skippable evsels that when they fail to open won't
>>>>> fail and won't appear in output. The TopdownL1 events, from the metric
>>>>> group, are marked as skippable. This turns the failure above to:
>>>>>
>>>>> ```
>>>>> $ perf stat true
>>>>>
>>>>> Performance counter stats for 'true':
>>>>>
>>>>> 1.26 msec task-clock:u # 0.328 CPUs utilized
>>>>> 0 context-switches:u # 0.000 /sec
>>>>> 0 cpu-migrations:u # 0.000 /sec
>>>>> 49 page-faults:u # 38.930 K/sec
>>>>> 176,449 cycles:u # 0.140 GHz (48.99%)
>>>>
>>>> Multiplexing?
>>>>
>>>> Thanks,
>>>> Kan
>>>
>>> I may have been running a test in the background otherwise I can't
>>> explain it. Repeating the test yields no multiplexing:
>>
>>
>> The above multiplexing should be on a Skylake (since there is no
>> topdownL1 printed), but the test which you repeat seems on a Tigerlake
>> (has topdownL1). Could you please double check on a Skylake?
>
> In the best circumstances (ie no EBS_Mode, no other events, nmi
> watchdog disabled) Skylake has multiplexing for TopdownL1:
>
No I mean the perf stat default on Skylake.
With this patch, on a Skylake machine, there should be no TopdownL1
event and no multiplexing.
>From your test result in the v2 description, we can see that there is no
TopdownL1, which is good and expected. However, there is a (48.99%) with
cycles:u event, which implies multiplexing. Could you please check
what's the problem here?
Also, if it's because of the backgroud, all the events should be
multiplexing. But it looks like only cycle:u has multiplexing. Other
events, instructions:u, branches:u and branch-misses:u work without
multiplexing. That's very strange.
> ```
> $ sudo perf stat --metric-no-group -M TopdownL1 --metric-no-group -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 500,145,019 INST_RETIRED.ANY # 14.2 %
> tma_retiring (71.07%)
> 2,402,640,337 CPU_CLK_UNHALTED.THREAD_ANY # 41.1 %
> tma_frontend_bound
> # 36.2 %
> tma_backend_bound
> # 8.4 %
> tma_bad_speculation (85.63%)
> 1,976,769,761 IDQ_UOPS_NOT_DELIVERED.CORE
> (85.81%)
> 114,069,133 INT_MISC.RECOVERY_CYCLES_ANY
> (85.83%)
> 684,605,487 UOPS_RETIRED.RETIRE_SLOTS
> (85.83%)
> 49,695,823 UOPS_RETIRED.MACRO_FUSED
> (85.83%)
> 860,603,122 UOPS_ISSUED.ANY
> (56.70%)
>
> 1.014011174 seconds time elapsed
> ```
>
> but this isn't a regression:
> https://lore.kernel.org/lkml/20200520072814.128267-1-irogers@xxxxxxxxxx/
>
> I think this is off-topic for this change.
>
> Thanks,
> Ian
>
>> Thanks,
>> Kan
>>>
>>> ```
>>> $ perf stat true
>>>
>>> Performance counter stats for 'true':
>>>
>>> 0.78 msec task-clock:u # 0.383
>>> CPUs utilized
>>> 0 context-switches:u # 0.000
>>> /sec
>>> 0 cpu-migrations:u # 0.000
>>> /sec
>>> 47 page-faults:u # 60.174
>>> K/sec
>>> 233,420 cycles:u # 0.299 GHz
>>> 133,318 instructions:u # 0.57
>>> insn per cycle
>>> 31,396 branches:u # 40.196
>>> M/sec
>>> 2,334 branch-misses:u # 7.43% of
>>> all branches
>>> 1,167,100 TOPDOWN.SLOTS:u # 12.2 %
>>> tma_retiring
>>> # 28.9 %
>>> tma_backend_bound
>>> # 41.0 %
>>> tma_frontend_bound
>>> # 18.0 %
>>> tma_bad_speculation
>>> 141,882 topdown-retiring:u
>>> 480,570 topdown-fe-bound:u
>>> 320,380 topdown-be-bound:u
>>> 224,266 topdown-bad-spec:u
>>> 2,173 INT_MISC.UOP_DROPPING:u # 2.782
>>> M/sec
>>> 3,323 cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
>>> 4.254 M/sec
>>>
>>>
>>> 0.002036744 seconds time elapsed
>>>
>>> 0.002252000 seconds user
>>> 0.000000000 seconds sys
>>> ```
>>>
This is apparently on a TigerLake, which has TopdownL1 with the perf
stat default. I don't have a problem with this.
Thanks,
Kan
>>> Thanks,
>>> Ian
>>>
>>>>> 122,905 instructions:u # 0.70 insn per cycle
>>>>> 28,264 branches:u # 22.456 M/sec
>>>>> 2,405 branch-misses:u # 8.51% of all branches
>>>>>
>>>>> 0.003834565 seconds time elapsed
>>>>>
>>>>> 0.000000000 seconds user
>>>>> 0.004130000 seconds sys
>>>>> ```
>>>>>
>>>>> When the events can have kernel/hypervisor disabled, like on
>>>>> Tigerlake, then it continues to succeed as:
>>>>>
>>>>> ```
>>>>> $ perf stat true
>>>>>
>>>>> Performance counter stats for 'true':
>>>>>
>>>>> 0.57 msec task-clock:u # 0.385 CPUs utilized
>>>>> 0 context-switches:u # 0.000 /sec
>>>>> 0 cpu-migrations:u # 0.000 /sec
>>>>> 47 page-faults:u # 82.329 K/sec
>>>>> 287,017 cycles:u # 0.503 GHz
>>>>> 133,318 instructions:u # 0.46 insn per cycle
>>>>> 31,396 branches:u # 54.996 M/sec
>>>>> 2,442 branch-misses:u # 7.78% of all branches
>>>>> 998,790 TOPDOWN.SLOTS:u # 14.5 % tma_retiring
>>>>> # 27.6 % tma_backend_bound
>>>>> # 40.9 % tma_frontend_bound
>>>>> # 17.0 % tma_bad_speculation
>>>>> 144,922 topdown-retiring:u
>>>>> 411,266 topdown-fe-bound:u
>>>>> 258,510 topdown-be-bound:u
>>>>> 184,090 topdown-bad-spec:u
>>>>> 2,585 INT_MISC.UOP_DROPPING:u # 4.528 M/sec
>>>>> 3,434 cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u # 6.015 M/sec
>>>>>
>>>>> 0.001480954 seconds time elapsed
>>>>>
>>>>> 0.000000000 seconds user
>>>>> 0.001686000 seconds sys
>>>>> ```
>>>>>
>>>>> And this likewise works if paranoia allows or running as root.
>>>>>
>>>>> v2. Don't display the skipped events as <not counted> or <not supported>.
>>>>>
>>>>> Signed-off-by: Ian Rogers <irogers@xxxxxxxxxx>
>>>>> ---
>>>>> tools/perf/builtin-stat.c | 39 ++++++++++++++++++++++++++--------
>>>>> tools/perf/util/evsel.c | 15 +++++++++++--
>>>>> tools/perf/util/evsel.h | 1 +
>>>>> tools/perf/util/stat-display.c | 4 ++++
>>>>> 4 files changed, 48 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>>>>> index d3cbee7460fc..7a641a67486d 100644
>>>>> --- a/tools/perf/builtin-stat.c
>>>>> +++ b/tools/perf/builtin-stat.c
>>>>> @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
>>>>> evsel_list->core.threads->err_thread = -1;
>>>>> return COUNTER_RETRY;
>>>>> }
>>>>> + } else if (counter->skippable) {
>>>>> + if (verbose > 0)
>>>>> + ui__warning("skipping event %s that kernel failed to open .\n",
>>>>> + evsel__name(counter));
>>>>> + counter->supported = false;
>>>>> + counter->errored = true;
>>>>> + return COUNTER_SKIP;
>>>>> }
>>>>>
>>>>> evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
>>>>> @@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
>>>>> * Add TopdownL1 metrics if they exist. To minimize
>>>>> * multiplexing, don't request threshold computation.
>>>>> */
>>>>> - if (metricgroup__has_metric("TopdownL1") &&
>>>>> - metricgroup__parse_groups(evsel_list, "TopdownL1",
>>>>> - /*metric_no_group=*/false,
>>>>> - /*metric_no_merge=*/false,
>>>>> - /*metric_no_threshold=*/true,
>>>>> - stat_config.user_requested_cpu_list,
>>>>> - stat_config.system_wide,
>>>>> - &stat_config.metric_events) < 0)
>>>>> - return -1;
>>>>> + if (metricgroup__has_metric("TopdownL1")) {
>>>>> + struct evlist *metric_evlist = evlist__new();
>>>>> + struct evsel *metric_evsel;
>>>>> +
>>>>> + if (!metric_evlist)
>>>>> + return -1;
>>>>> +
>>>>> + if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
>>>>> + /*metric_no_group=*/false,
>>>>> + /*metric_no_merge=*/false,
>>>>> + /*metric_no_threshold=*/true,
>>>>> + stat_config.user_requested_cpu_list,
>>>>> + stat_config.system_wide,
>>>>> + &stat_config.metric_events) < 0)
>>>>> + return -1;
>>>>> +
>>>>> + evlist__for_each_entry(metric_evlist, metric_evsel) {
>>>>> + metric_evsel->skippable = true;
>>>>> + }
>>>>> + evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
>>>>> + evlist__delete(metric_evlist);
>>>>> + }
>>>>> +
>>>>> /* Platform specific attrs */
>>>>> if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
>>>>> return -1;
>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>> index a85a987128aa..83a65f771666 100644
>>>>> --- a/tools/perf/util/evsel.c
>>>>> +++ b/tools/perf/util/evsel.c
>>>>> @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
>>>>> evsel->per_pkg_mask = NULL;
>>>>> evsel->collect_stat = false;
>>>>> evsel->pmu_name = NULL;
>>>>> + evsel->skippable = false;
>>>>> }
>>>>>
>>>>> struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
>>>>> @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
>>>>> return -1;
>>>>>
>>>>> fd = FD(leader, cpu_map_idx, thread);
>>>>> - BUG_ON(fd == -1);
>>>>> + BUG_ON(fd == -1 && !leader->skippable);
>>>>>
>>>>> - return fd;
>>>>> + /*
>>>>> + * When the leader has been skipped, return -2 to distinguish from no
>>>>> + * group leader case.
>>>>> + */
>>>>> + return fd == -1 ? -2 : fd;
>>>>> }
>>>>>
>>>>> static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
>>>>> @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>>>>>
>>>>> group_fd = get_group_fd(evsel, idx, thread);
>>>>>
>>>>> + if (group_fd == -2) {
>>>>> + pr_debug("broken group leader for %s\n", evsel->name);
>>>>> + err = -EINVAL;
>>>>> + goto out_close;
>>>>> + }
>>>>> +
>>>>> test_attr__ready();
>>>>>
>>>>> /* Debug message used by test scripts */
>>>>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
>>>>> index 68072ec655ce..98afe3351176 100644
>>>>> --- a/tools/perf/util/evsel.h
>>>>> +++ b/tools/perf/util/evsel.h
>>>>> @@ -95,6 +95,7 @@ struct evsel {
>>>>> bool weak_group;
>>>>> bool bpf_counter;
>>>>> bool use_config_name;
>>>>> + bool skippable;
>>>>> int bpf_fd;
>>>>> struct bpf_object *bpf_obj;
>>>>> struct list_head config_terms;
>>>>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
>>>>> index e6035ecbeee8..6b46bbb3d322 100644
>>>>> --- a/tools/perf/util/stat-display.c
>>>>> +++ b/tools/perf/util/stat-display.c
>>>>> @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
>>>>> struct perf_cpu cpu;
>>>>> int idx;
>>>>>
>>>>> + /* Skip counters that were speculatively/default enabled rather than requested. */
>>>>> + if (counter->skippable)
>>>>> + return true;
>>>>> +
>>>>> /*
>>>>> * Skip value 0 when enabling --per-thread globally,
>>>>> * otherwise it will have too many 0 output.