Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

From: Akihiko Odaki
Date: Wed Mar 19 2025 - 07:31:24 EST


On 2025/03/19 20:07, Marc Zyngier wrote:
On Wed, 19 Mar 2025 10:26:57 +0000,
Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

It should also be the reason why the perf program creates an event for
each PMU. tools/perf/Documentation/intel-hybrid.txt has more
descriptions.

But perf on non-Intel behaves pretty differently. ARM PMUs behaves
pretty differently, because there is no guarantee of homogeneous
events.

It works in the same manner in this particular aspect (i.e., "perf
stat -e cycles -a" creates events for all PMUs).

But it then becomes a system-wide counter, and that's not what KVM
needs to do.

There is also an example of program profiling:
"perf stat -e cycles \-- taskset -c 16 ./triad_loop"

This also creates events for all PMUs.


Allowing to enable more than one counter and/or an event type other
than the cycle counter is not the goal. Enabling another event type
may result in a garbage value, but I don't think it's worse than the
current situation where the count stays zero; please tell me if I miss
something.

There is still room for improvement. Returning a garbage value may not
be worse than returning zero, but counters and event types not
supported by some cores shouldn't be advertised as available in the
first place. More concretely:

- The vCPU should be limited to run only on cores covered by PMUs when
KVM_ARM_VCPU_PMU_V3 is set.

That's userspace's job. Bind to the desired PMU, and run. KVM will
actively prevent you from running on the wrong CPU.

- PMCR_EL0.N advertised to the guest should be the minimum of ones of
host PMUs.

How do you find out? CPUs can be hot-plugged on long after a VM has
started, bringing in a new PMU, with a different number of counters.

- PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the
result of the AND operations of ones of host PMUs.

Same problem.

I guess special-casing the cycle counter is the only option if the
kernel is going to deal with this.

Indeed. I think Oliver's idea is the least bad of them all, but man,
this is really ugly.

Special-casing the cycle counter may make sense if we are going to fix
the advertised values of PMCR_EL0.N, PMCEID0_EL0, and
PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these
registers. We can also prevent enabling a counter that returns zero or
a garbage value.

Do you think it's worth fixing these registers? If so, I'll do that by
special-casing the cycle counter.

I think this is really going in the wrong direction.

The whole design of the PMU emulation is that we expose a single,
architecturally correct PMU implementation. This is clearly
documented.

Furthermore, userspace is being given all the relevant information to
place vcpus on the correct physical CPUs. Why should we add this sort
of hack in the kernel, creating a new userspace ABI that we will have
to support forever, when usespace can do the correct thing right now?

Worse case, this is just a 'taskset' away, and everything will work.

It's surprisingly difficult to do that with libvirt; of course it is a
userspace problem though.

Sorry, I must admit I'm completely ignorant of libvirt. I tried it
years ago, and concluded that 95% of what I needed was adequately done
with a shell script...

Frankly, I'm not prepared to add more hacks to KVM for the sake of the
combination of broken userspace and broken guest.

The only counter argument I have in this regard is that some change is
also needed to expose all CPUs to Windows guest even when the
userspace does its best. It may result in odd scheduling, but still
gives the best throughput.

But that'd be a new ABI, which again would require buy-in from
userspace. Maybe there is scope for an all CPUs, cycle-counter only
PMUv3 exposed to the guest, but that cannot be set automatically, as
we would otherwise regress existing setups.

At this stage, and given that you need to change userspace, I'm not
sure what the best course of action is.

Having an explicit flag for the userspace is fine for QEMU, which I care. It can flip the flag if and only if threads are not pinned to one PMU and the machine is a new setup.

I also wonder what regression you think setting it automatically causes.

Regards,
Akihiko Odaki


Thanks,

M.