On Wed, 19 Mar 2025 10:26:57 +0000,
Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:
It should also be the reason why the perf program creates an event for
each PMU. tools/perf/Documentation/intel-hybrid.txt has more
descriptions.
But perf on non-Intel behaves pretty differently. ARM PMUs behaves
pretty differently, because there is no guarantee of homogeneous
events.
It works in the same manner in this particular aspect (i.e., "perf
stat -e cycles -a" creates events for all PMUs).
But it then becomes a system-wide counter, and that's not what KVM
needs to do.
Allowing to enable more than one counter and/or an event type other
than the cycle counter is not the goal. Enabling another event type
may result in a garbage value, but I don't think it's worse than the
current situation where the count stays zero; please tell me if I miss
something.
There is still room for improvement. Returning a garbage value may not
be worse than returning zero, but counters and event types not
supported by some cores shouldn't be advertised as available in the
first place. More concretely:
- The vCPU should be limited to run only on cores covered by PMUs when
KVM_ARM_VCPU_PMU_V3 is set.
That's userspace's job. Bind to the desired PMU, and run. KVM will
actively prevent you from running on the wrong CPU.
- PMCR_EL0.N advertised to the guest should be the minimum of ones of
host PMUs.
How do you find out? CPUs can be hot-plugged on long after a VM has
started, bringing in a new PMU, with a different number of counters.
- PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the
result of the AND operations of ones of host PMUs.
Same problem.
I guess special-casing the cycle counter is the only option if the
kernel is going to deal with this.
Indeed. I think Oliver's idea is the least bad of them all, but man,
this is really ugly.
Special-casing the cycle counter may make sense if we are going to fix
the advertised values of PMCR_EL0.N, PMCEID0_EL0, and
PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these
registers. We can also prevent enabling a counter that returns zero or
a garbage value.
Do you think it's worth fixing these registers? If so, I'll do that by
special-casing the cycle counter.
I think this is really going in the wrong direction.
The whole design of the PMU emulation is that we expose a single,
architecturally correct PMU implementation. This is clearly
documented.
Furthermore, userspace is being given all the relevant information to
place vcpus on the correct physical CPUs. Why should we add this sort
of hack in the kernel, creating a new userspace ABI that we will have
to support forever, when usespace can do the correct thing right now?
Worse case, this is just a 'taskset' away, and everything will work.
It's surprisingly difficult to do that with libvirt; of course it is a
userspace problem though.
Sorry, I must admit I'm completely ignorant of libvirt. I tried it
years ago, and concluded that 95% of what I needed was adequately done
with a shell script...
Frankly, I'm not prepared to add more hacks to KVM for the sake of the
combination of broken userspace and broken guest.
The only counter argument I have in this regard is that some change is
also needed to expose all CPUs to Windows guest even when the
userspace does its best. It may result in odd scheduling, but still
gives the best throughput.
But that'd be a new ABI, which again would require buy-in from
userspace. Maybe there is scope for an all CPUs, cycle-counter only
PMUv3 exposed to the guest, but that cannot be set automatically, as
we would otherwise regress existing setups.
At this stage, and given that you need to change userspace, I'm not
sure what the best course of action is.
Thanks,
M.