Re: [PATCH v2 2/3] perf: Add driver for Arm NI-700 interconnect PMU

From: Robin Murphy
Date: Fri Aug 16 2024 - 14:05:57 EST


On 15/08/2024 6:04 pm, Mark Rutland wrote:
On Wed, Jul 10, 2024 at 05:09:34PM +0100, Robin Murphy wrote:
The Arm NI-700 Network-on-Chip Interconnect has a relatively
straightforward design with a hierarchy of voltage, power, and clock
domains, where each clock domain then contains a number of interface
units and a PMU which can monitor events thereon. As such, it begets a
relatively straightforward driver to interface those PMUs with perf.

Even more so than with arm-cmn, users will require detailed knowledge of
the wider system topology in order to meaningfully analyse anything,
since the interconnect itself cannot know what lies beyond the boundary
of each inscrutably-numbered interface. Given that, for now they are
also expected to refer to the NI-700 documentation for the relevant
event IDs to provide as well. An identifier is implemented so we can
come back and add jevents if anyone really wants to.

IIUC the relevant documentation would be the NI-700 TRM:

https://developer.arm.com/documentation/101566/0203/?lang=en

... right?

That's the one.

[...]

+====================================
+Arm Network-on Chip Interconnect PMU
+====================================
+
+NI-700 and friends implement a distinct PMU for each clock domain within the
+interconnect. Correspondingly, the driver exposes multiple PMU devices named
+arm_ni_<x>_cd_<y>, where <x> is an (abritrary) instance identifier and <y> is
+the clock domain ID within that particular instance. If multiple NI instances
+exist within a system, the PMU devices can be correlated with the underlying
+hardware instance via sysfs parentage.

I suspect that name suffixing *might* confuse userspace in some cases,
since IIUC the perf tool tries to aggregate some_pmu_<number> instances,
and here it would presumably aggregate all the clock domains as a
arm_ni_<x>_cd PMU.

Indeed that is liable to be particularly funky since different domains may well contain different sets of interface types and thus not even necessarily share the same set of base events. Either way, events are targeted at specific interfaces so aggregation at any level would generally be meaningless.

We should check how that behaves and speak to the userspace folk about
this; taking a step back we probably need a better way for determining
when things can or should be aggregated. >
I assume we don't have HW to test on, but we should be able to fake up
the sysfs hierarchy and see how "perf list" treats that.

I see there is an available FVP for the Total Compute TC2 system[1] which includes an NI-700 instance - it doesn't model the PMU counters, but the rest should work enough for discovery. I might have a go with that next week...

[...]

+static int arm_ni_validate_group(struct perf_event *event)
+{
+ struct perf_event *sibling, *leader;
+ struct arm_ni_val val = { 0 };
+
+ arm_ni_validate_event(event, &val);
+
+ leader = event->group_leader;
+ if (leader != event && !arm_ni_validate_event(leader, &val))
+ return - EINVAL;
+
+ for_each_sibling_event(sibling, leader) {
+ if (!arm_ni_validate_event(sibling, &val))
+ return - EINVAL;
+ }
+ return 0;
+}

As a trivial nit, something has gone wrong with spacing and you have:

return - EINVAL;

... rather than:

return -EINVAL;

As a more substantial thing, this will trigger splats when lockdep is
enabled and an event is opened where event == event->group_leader,
because for_each_sibling_event(..., event) checks event->ctx->mutex is
held, and event->ctx isn't initialised until pmu::event_init() returns.

Oof, I even remember looking at that report, so I don't really have any excuse!

That's a latent bug in many PMU drivers at the moment, and is on my TODO
list to fix elsewhere. For now, can you make the above:

| static int arm_ni_validate_group(struct perf_event *event)
| {
| struct perf_event *sibling, *leader = event->group_leader;
| struct arm_ni_val val = { 0 };
|
| if (event == leader)
| return 0;
|
| arm_ni_validate_event(event, &val);
|
| if (!arm_ni_validate_event(leader, &val))
| return -EINVAL;
|
| for_each_sibling_event(sibling, leader) {
| if (!arm_ni_validate_event(sibling, &val))
| return -EINVAL;
| }
|
| return 0;
| }

... where the early exit for the (event == leader) case will avoid the
bad call.

Took me a moment to see why it's OK to skip "validating" the event as leader itself, but that's arguably just as much down to my naming. I'll polish that up a bit further.

[...]

+static int arm_ni_init_cd(struct arm_ni *ni, struct arm_ni_node *node, u64 res_start)
+{
+ struct arm_ni_cd *cd = ni->cds + node->id;
+ const char *name;
+ int err;

+ cd->cpu = cpumask_local_spread(0, dev_to_node(ni->dev));

Can dev_to_node(ni->dev) return NUMA_NO_NODE, and if so, do we need to
error out here? ...

It can, and cpumask_local_spread() just picks from cpu_online_mask in that case. It's not an error either way, the intent is just that if we do have NUMA, then it makes sense to try to associate with a physically-close CPU.

+static int arm_ni_pmu_online_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
+{
+ struct arm_ni_cd *cd;
+ int node;
+
+ cd = hlist_entry_safe(cpuhp_node, struct arm_ni_cd, cpuhp_node);
+ node = dev_to_node(cd_to_ni(cd)->dev);
+ if (node != NUMA_NO_NODE && cpu_to_node(cd->cpu) != node && cpu_to_node(cpu) == node)
+ arm_ni_pmu_migrate(cd, cpu);
+ return 0;
+}

... since we expect to handle similar here ...

...and conversely, without NUMA there's no point even asking the question of whether a new CPU might be more local than our current one.

+
+static int arm_ni_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
+{
+ struct arm_ni_cd *cd;
+ unsigned int target;
+ int node;
+
+ cd = hlist_entry_safe(cpuhp_node, struct arm_ni_cd, cpuhp_node);
+ if (cpu != cd->cpu)
+ return 0;
+
+ node = dev_to_node(cd_to_ni(cd)->dev);
+ target = cpumask_any_and_but(cpumask_of_node(node), cpu_online_mask, cpu);
+ if (target >= nr_cpu_ids)
+ target = cpumask_any_but(cpu_online_mask, cpu);
+
+ if (target < nr_cpu_ids)
+ arm_ni_pmu_migrate(cd, target);
+ return 0;
+}

... though not here, and overall that seems inconsistent.

I did start down that rabbit hole, but gave up in despair around the point of the "#ifdef CONFIG_NUMA" inside the "#ifndef CONFIG_NUMA". My conclusion was that the majority of cpumask_of_node() implementations do handle NUMA_NO_NODE, making it appear expected, and there is already other code assuming so, so the ones which don't were not my problem.

Thanks,
Robin.