Re: [PATCH 2/2] perf tools: Improve IBS error handling

From: Kim Phillips
Date: Mon Nov 29 2021 - 17:12:35 EST


On 11/24/21 2:00 AM, kajoljain wrote:
On 11/23/21 8:55 PM, Kim Phillips wrote:
On 11/23/21 2:40 AM, kajoljain wrote:
On 10/8/21 12:47 AM, Kim Phillips wrote:
On 10/7/21 12:28 PM, Jiri Olsa wrote:
On Mon, Oct 04, 2021 at 04:41:14PM -0500, Kim Phillips wrote:
---
   tools/perf/util/evsel.c | 24 ++++++++++++++++++++++++
   1 file changed, 24 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index b915840690d4..f8a9cbd99314 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2743,9 +2743,22 @@ static bool find_process(const char *name)
       return ret ? false : true;
   }
   +static bool is_amd(const char *arch, const char *cpuid)
+{
+    return arch && !strcmp("x86", arch) && cpuid && strstarts(cpuid,
"AuthenticAMD");
+}
+
+static bool is_amd_ibs(struct evsel *evsel)
+{
+    return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name,
"ibs", 3);
+}
+
   int evsel__open_strerror(struct evsel *evsel, struct target
*target,
                int err, char *msg, size_t size)
   {
+    struct perf_env *env = evsel__env(evsel);
+    const char *arch = perf_env__arch(env);
+    const char *cpuid = perf_env__cpuid(env);
       char sbuf[STRERR_BUFSIZE];
       int printed = 0, enforced = 0;
   @@ -2841,6 +2854,17 @@ int evsel__open_strerror(struct evsel
*evsel, struct target *target,
               return scnprintf(msg, size, "wrong clockid (%d).",
clockid);
           if (perf_missing_features.aux_output)
               return scnprintf(msg, size, "The 'aux_output' feature
is not supported, update the kernel.");
+        if (is_amd(arch, cpuid)) {
+            if (is_amd_ibs(evsel)) {

would single 'is_amd_ibs' call be better? checking on both amd and ibs

Good suggestion. If you look at the later patch in the
BRS series, I have rewritten it to add the new
AMD PMU like so:

  if (is_amd()) {
      if (is_amd_ibs()) {
          if (evsel->this)
              return
          if (evsel->that)
              return
      }
+    if (is_amd_brs()) {
+        if (evsel->this)
+            return
+        if (evsel->that)
+            return
+    }
  }

Hi Kim,
      From my point of view, it won't be a good idea of adding so many
checks in common function definition itself.
Can you just create a check to see if its amd machine and then add a
function call which will handle all four conditions together?

which is basically for:

+        if (is_amd(arch, cpuid)) {
+            if (is_amd_ibs(evsel)) {
+                if (evsel->core.attr.exclude_kernel)
+                    return scnprintf(msg, size,
+    "AMD IBS can't exclude kernel events.  Try running at a higher
privilege level.");
+                if (!evsel->core.system_wide)
+                    return scnprintf(msg, size,
+    "AMD IBS may only be available in system-wide/per-cpu mode.  Try
using
-a, or -C and workload affinity");
+            }

and this:

+            if (is_amd_brs(evsel)) {
+                if (evsel->core.attr.freq)
+                    return scnprintf(msg, size,
+    "AMD Branch Sampling does not support frequency mode sampling, must
pass a fixed sampling period via -c option or
cpu/branch-brs,period=xxxx/.");
+                /* another reason is that the period is too small */
+                return scnprintf(msg, size,
+    "AMD Branch Sampling does not support sampling period smaller than
what is reported in /sys/devices/cpu/caps/branches.");
+            }

IIRC, I tried something like that but carrying the


struct target *target, int err, char *msg, size_t size

parameters made things worse.

So, incase we are in amd machine,  common function evsel__open_strerror
will call function may be something like amd_evesel_open_strerror_check
which will look for both ibs and brs conditions and return corresponding
error statement.

The vast majority of decisions made by evsel__open_strerror are
going to be common across most arch/uarches.  AMD has only these
two pesky exceptions to the rule and therefore IMO it's ok
to have them inline with the common function, since the decisions
are so deeply intertwined.  A new amd_evsel_open_strerror_check
sounds like it'd duplicate too much of the common function code
in order to handle the common error cases.

Hi Kim,
Sorry for the confusion, what I meant by adding new function is just
to handle these corner error cases and not duplicating whole
evsel__open_strerror code.

Maybe something like below code, Its just prototype of code to show you
the flow, you can refine it and check for any build or indentation
issues using checkpatch.pl script.

So basically, in common function we can just have 2 calls, first to
check if we are in amd system and second to return corresponding error
message, rather then adding whole chunk of if's which are specific to amd.

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index ac0127be0459..adefb162ae08 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2852,9 +2852,40 @@ static bool find_process(const char *name)
return ret ? false : true;
}

+static bool is_amd(const char *arch, const char *cpuid)
+{
+ return arch && !strcmp("x86", arch) && cpuid && strstarts(cpuid,
"AuthenticAMD");
+}
+
+static int error_amd_ibs_brs(struct evsel *evsel, char *msg, size_t size)
+{
+ if (evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name,
"ibs", 3)) {
+ if (evsel->core.attr.exclude_kernel)
+ return scnprintf(msg, size,
+ "AMD IBS can't exclude kernel events. Try running at a higher
privilege level.");
+ if (!evsel->core.system_wide)
+ return scnprintf(msg, size,
+ "AMD IBS may only be available in system-wide/per-cpu mode. Try
using -a, or -C and workload affinity");
+ }
+
+ if (((evsel->core.attr.config & 0xff) == 0xc4) &&
(evsel->core.attr.sample_type & PERF_SAMPLE_BRANCH_STACK)) {
+ if (evsel->core.attr.freq) {
+ return scnprintf(msg, size,
+ "AMD Branch Sampling does not support frequency mode sampling,
must pass a fixed sampling
+ period via -c option or cpu/branch-brs,period=xxxx/.");
+ /* another reason is that the period is too small */
+ return scnprintf(msg, size,
+ "AMD Branch Sampling does not support sampling period smaller
than what is reported in /sys/devices/cpu/caps/branches.");
+ }
+ }
+}
+
int evsel__open_strerror(struct evsel *evsel, struct target *target,
int err, char *msg, size_t size)
{
+ struct perf_env *env = evsel__env(evsel);
+ const char *arch = perf_env__arch(env);
+ const char *cpuid = perf_env__cpuid(env);
char sbuf[STRERR_BUFSIZE];
int printed = 0, enforced = 0;

@@ -2950,6 +2981,8 @@ int evsel__open_strerror(struct evsel *evsel,
struct target *target,
return scnprintf(msg, size, "wrong clockid
(%d).", clockid);
if (perf_missing_features.aux_output)
return scnprintf(msg, size, "The 'aux_output'
feature is not supported, update the kernel.");
+ if (is_amd(arch, cpuid))
+ return error_amd_ibs_brs(evsel, msg, size);
break;
case ENODATA:
return scnprintf(msg, size, "Cannot collect data source
with the load latency event alone. "

That change will makes AMD machines fail to fall back to the default
"The sys_perf_event_open() syscall returned with..." error string
in case it's not those AMD IBS and BRS sub-conditions.

Is having the AMD error code checking in the main evsel__open_strerror()
so bad? Other arches and their PMU implementations may find error
conditions that they have in common with AMD's, therefore
opening up the code for opposite types of refactoring and
reuse than what is being requested here. E.g., I've seen
other hardware configurations - not specific to one architecture -
that could also use this message:

{"AMD IBS"->"%s",pmu_name} may only be available in system-wide/per-cpu mode. Try using -a, or -C and workload affinity");

Thanks,

Kim