Re: [PATCH] accel: ethosu: Add performance counter support

From: Rob Herring

Date: Tue May 12 2026 - 11:39:28 EST


On Tue, May 5, 2026 at 1:46 PM Maíra Canal <mcanal@xxxxxxxxxx> wrote:
>
> Hi Rob,

Thanks for the review Maíra!

> On 06/03/26 13:36, Rob Herring (Arm) wrote:
> > The Arm Ethos-U NPUs have a PMU with performance counters. The PMU h/w
> > supports up to 4 (U65) or 8 (U85) counters which can be programmed for
> > different events. There is also a dedicated cycle counter.
> >
> > The ABI and implementation are copied from the V3D driver. The main
>
> I'm planning to send a series fixing some perfmon locking issues in the
> V3D driver later this week. But, to give you a brief summary of the main
> issues:
>
> 1. The active perfmon is not protected against races (run_job() vs.
> IOCTLs).
>
> 2. ethosu_switch_perfmon() only starts/stops the perfmon in run_job().
> This means that if nothing is further queued up, a perfmon will never
> actually be stopped.

For ethosu, other than idle and cycle counts, nothing else counts when
a job is done. Idle is just the time from enabling the counter to
start a job and then from the job stopping to reading the counters.
IOW, the idle count while a job is running is 0 and
idle+active=cycles.


> More comments inline:
>
> > difference in the ABI is there is no query API for the the event list.
> > The events differ between the U65 and U85, so the events lists are
> > maintained in userspace along with other differences between the U65 and
> > U85.
> >
> > The cycle counter is always enabled when the PMU is enabled. When the
> > user requests N events, reading the counters will return the N events
> > plus the cycle counter.
> >
> > Signed-off-by: Rob Herring (Arm) <robh@xxxxxxxxxx>
> > ---
> > MR for mesa with initial support is here:
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40269
> >
> > drivers/accel/ethosu/Makefile | 2 +-
> > drivers/accel/ethosu/ethosu_device.h | 21 ++
> > drivers/accel/ethosu/ethosu_drv.c | 17 +-
> > drivers/accel/ethosu/ethosu_drv.h | 62 ++++++
> > drivers/accel/ethosu/ethosu_job.c | 35 +++-
> > drivers/accel/ethosu/ethosu_job.h | 2 +
> > drivers/accel/ethosu/ethosu_perfmon.c | 282 ++++++++++++++++++++++++++
> > include/uapi/drm/ethosu_accel.h | 59 +++++-
> > 8 files changed, 469 insertions(+), 11 deletions(-)
> > create mode 100644 drivers/accel/ethosu/ethosu_perfmon.c
> >
>
> [...]
>
> > diff --git a/drivers/accel/ethosu/ethosu_drv.h b/drivers/accel/ethosu/ethosu_drv.h
> > index 9e21dfe94184..57eee79e4b83 100644
> > --- a/drivers/accel/ethosu/ethosu_drv.h
> > +++ b/drivers/accel/ethosu/ethosu_drv.h
> > @@ -3,13 +3,75 @@
> > #ifndef __ETHOSU_DRV_H__
> > #define __ETHOSU_DRV_H__
> >
> > +#include <linux/idr.h>
> > +#include <linux/mutex.h>
> > #include <drm/gpu_scheduler.h>
> >
> > struct ethosu_device;
> > +struct drm_device;
> > +struct drm_file;
> >
> > struct ethosu_file_priv {
> > struct ethosu_device *edev;
> > struct drm_sched_entity sched_entity;
> > +
> > + struct {
> > + struct idr idr;
>
> You may want to consider moving to XArray as we did in V3D [1].

Okay. I'll have to study that some. I still don't have my head around XArray...

>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a5b0d095bcdb219348ed8ae1c97ee99fc4913b8
>
> > + struct mutex lock;
> > + } perfmon;
> > +};
> > +
> > +/* Performance monitor object. The perform lifetime is controlled by userspace
> > + * using perfmon related ioctls. A perfmon can be attached to a submit_cl
> > + * request, and when this is the case, HW perf counters will be activated just
> > + * before the submit_cl is submitted to the GPU and disabled when the job is
> > + * done. This way, only events related to a specific job will be counted.
> > + */
> > +struct ethosu_perfmon {
> > + /* Tracks the number of users of the perfmon, when this counter reaches
> > + * zero the perfmon is destroyed.
> > + */
> > + refcount_t refcnt;
> > +
> > + /* Protects perfmon stop, as it can be invoked from multiple places. */
> > + struct mutex lock;
>
> I'm not familiar with the Ethos hardware, but *if* it has only a single
> performance monitor active in HW at any given moment (like v3d), this
> lock doesn't enforce this restriction.
>
> This lock only serializes start/stop of *one* perfmon object against
> itself, but the invariant (at least on v3d) that needs protection is
> device-wide: there is exactly one active perfmon at any moment in HW.

You mean V3D has 1 perfmon shared among its 3? h/w queues? We only
have 1 queue. We only touch the perfmon when we touch the h/w to
run/end a job, so the same serialization for the job accesses should
cover the PMU accesses.

Rob