Re: [PATCH v10 4/5] arm64: perf: Enable PMU counter userspace access for perf event

From: Rob Herring
Date: Thu Oct 14 2021 - 15:25:05 EST


On Thu, Oct 14, 2021 at 11:58 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> Hi Rob,
>
> This looks pretty good!
>
> I have one largish query below, and otherwise only trivialities that I'm
> happy to fix up.
>
> On Tue, Sep 14, 2021 at 03:47:59PM -0500, Rob Herring wrote:
> > Arm PMUs can support direct userspace access of counters which allows for
> > low overhead (i.e. no syscall) self-monitoring of tasks. The same feature
> > exists on x86 called 'rdpmc'. Unlike x86, userspace access will only be
> > enabled for thread bound events. This could be extended if needed, but
> > simplifies the implementation and reduces the chances for any
> > information leaks (which the x86 implementation suffers from).
> >
> > PMU EL0 access will be enabled when an event with userspace access is
> > part of the thread's context. This includes when the event is not
> > scheduled on the PMU. There's some additional overhead clearing
> > dirty counters when access is enabled in order to prevent leaking
> > disabled counter data from other tasks.
> >
> > Unlike x86, enabling of userspace access must be requested with a new
> > attr bit: config1:1. If the user requests userspace access and 64-bit
> > counters, then chaining will be disabled and the user will get the
> > maximum size counter the underlying h/w can support. The modes for
> > config1 are as follows:
> >
> > config1 = 0 : user access disabled and always 32-bit
> > config1 = 1 : user access disabled and always 64-bit (using chaining if needed)
> > config1 = 2 : user access enabled and always 32-bit
> > config1 = 3 : user access enabled and counter size matches underlying counter.
>
> We probably need to note somewhere (i.e. in the next patch) that we mean
> *logically* 32-bit, and this could be a biased 64-bit counter, so
> userspace needs to treat the upper 32-bits of counters as UNKNOWN.

Okay, though this detail doesn't matter if the user uses the correct
read loop (now in libperf).

> For the `config1 = 3` case (potentially) overriding the usual long
> semantic, I'm struggling to understand why we need that rather than
> forcing the use of a 64-bit counter, because in that case:
>
> * For a CPU_CYCLES event:
> __armv8_pmuv3_map_event() will always pick 64-bits
> get_event_idx() may fail to allocate a 64-bit counter.
>
> * For other events:
> __armv8_pmuv3_map_event() will pick 32/64 based on long counter
> support
> get_event_idx() will only fail if there are no counters free.
>
> Whereas if __armv8_pmuv3_map_event() returned an error for the latter
> when long counter support is not implemented, we'd have consistent
> `long` semantics, and the CPU_CYCLES behaviour would be identical.
>
> What's the rationale for `3` leaving the choice to the kernel?

It's the give me the maximum sized counter the h/w can support choice.
That's easier for userspace to implement. Bit 1 is more of a hint that
the user wants userspace access rather than a requirement.

> If the problem is discoverability, I'd be happy to add something to
> sysfs to describe whether the PMU has long event support.

Checking sysfs or a try for 64-bit support then fall back to 32-bit
support isn't much difference.

Keep in mind that x86 always succeeds here. Every userspace user will
have to add whatever dance we create here. For example, each libperf
test with user access (there's only 2 in my tree, but there's a series
adding more) has to have an '#ifdef __aarch64__' for whatever we do
here. I was seeking to minimize that. Right now, that's just a set
config1 to 0x3. Also, note that libperf will opportunistically use a
userspace read instead of read(). The user just has to mmap the event
and libperf will use a userspace read when enabled which ultimately
depends on what the mmapped page says.

[...]

> > @@ -995,9 +1050,23 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
> > &armv8_pmuv3_perf_cache_map,
> > ARMV8_PMU_EVTYPE_EVENT);
> >
> > - if (armv8pmu_event_is_64bit(event))
> > + /*
> > + * At this point, the counter is not assigned. If a 64-bit counter is
> > + * requested, we must make sure the h/w has 64-bit counters if we set
> > + * the event size to 64-bit because chaining is not supported with
> > + * userspace access. This may still fail later on if the CPU cycle
> > + * counter is in use.
> > + */
> > + if (armv8pmu_event_is_64bit(event) &&
> > + (!armv8pmu_event_want_user_access(event) ||
> > + armv8pmu_has_long_event(armpmu) || (hw_event_id == ARMV8_PMUV3_PERFCTR_CPU_CYCLES)))
> > event->hw.flags |= ARMPMU_EVT_64BIT;
>
> If we can follow my suggestion in reply to the cover text, we can make
> this:
>
> if (armv8pmu_event_is_64bit(event))
> event->hw.flags |= ARMPMU_EVT_64BIT;
>
> /*
> * User events must be allocated into a single counter, and so
> * must not be chained.
> *
> * Most 64-bit events require long counter support, but 64-bit
> * CPU_CYCLES events can be placed into the dedicated cycle
> * counter when this is free.
> *
> if (armv8pmu_event_want_user_access()) {
> if (armv8pmu_event_is_64bit(event) &&
> (hw_event_id != ARMV8_PMUV3_PERFCTR_CPU_CYCLES) &&
> !armv8pmu_has_long_event(armpmu))
> return -EINVAL;
> }
>
> > + /* Userspace counter access only enabled if requested and a per task event */
> > + if (sysctl_perf_user_access && armv8pmu_event_want_user_access(event) &&
> > + (event->attach_state & PERF_ATTACH_TASK))
> > + event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
>
> Can we please explicitly reject !PERF_ATTACH_TASK case?
>
> If the user requested something we don't intend to support, I'd rather
> return -EINVAL here, rather than continue on.

This is similar to the 64-bit case though I'm somewhat less concerned
here given per cpu events aren't too useful in this case and the setup
is a bit different already.

Rob