Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

From: Marc Zyngier
Date: Wed Mar 19 2025 - 07:22:50 EST


On Wed, 19 Mar 2025 10:26:57 +0000,
Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:
>
> >> It should also be the reason why the perf program creates an event for
> >> each PMU. tools/perf/Documentation/intel-hybrid.txt has more
> >> descriptions.
> >
> > But perf on non-Intel behaves pretty differently. ARM PMUs behaves
> > pretty differently, because there is no guarantee of homogeneous
> > events.
>
> It works in the same manner in this particular aspect (i.e., "perf
> stat -e cycles -a" creates events for all PMUs).

But it then becomes a system-wide counter, and that's not what KVM
needs to do.

> >> Allowing to enable more than one counter and/or an event type other
> >> than the cycle counter is not the goal. Enabling another event type
> >> may result in a garbage value, but I don't think it's worse than the
> >> current situation where the count stays zero; please tell me if I miss
> >> something.
> >>
> >> There is still room for improvement. Returning a garbage value may not
> >> be worse than returning zero, but counters and event types not
> >> supported by some cores shouldn't be advertised as available in the
> >> first place. More concretely:
> >>
> >> - The vCPU should be limited to run only on cores covered by PMUs when
> >> KVM_ARM_VCPU_PMU_V3 is set.
> >
> > That's userspace's job. Bind to the desired PMU, and run. KVM will
> > actively prevent you from running on the wrong CPU.
> >
> >> - PMCR_EL0.N advertised to the guest should be the minimum of ones of
> >> host PMUs.
> >
> > How do you find out? CPUs can be hot-plugged on long after a VM has
> > started, bringing in a new PMU, with a different number of counters.
> >
> >> - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the
> >> result of the AND operations of ones of host PMUs.
> >
> > Same problem.
>
> I guess special-casing the cycle counter is the only option if the
> kernel is going to deal with this.

Indeed. I think Oliver's idea is the least bad of them all, but man,
this is really ugly.

> >> Special-casing the cycle counter may make sense if we are going to fix
> >> the advertised values of PMCR_EL0.N, PMCEID0_EL0, and
> >> PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these
> >> registers. We can also prevent enabling a counter that returns zero or
> >> a garbage value.
> >>
> >> Do you think it's worth fixing these registers? If so, I'll do that by
> >> special-casing the cycle counter.
> >
> > I think this is really going in the wrong direction.
> >
> > The whole design of the PMU emulation is that we expose a single,
> > architecturally correct PMU implementation. This is clearly
> > documented.
> >
> > Furthermore, userspace is being given all the relevant information to
> > place vcpus on the correct physical CPUs. Why should we add this sort
> > of hack in the kernel, creating a new userspace ABI that we will have
> > to support forever, when usespace can do the correct thing right now?
> >
> > Worse case, this is just a 'taskset' away, and everything will work.
>
> It's surprisingly difficult to do that with libvirt; of course it is a
> userspace problem though.

Sorry, I must admit I'm completely ignorant of libvirt. I tried it
years ago, and concluded that 95% of what I needed was adequately done
with a shell script...

> > Frankly, I'm not prepared to add more hacks to KVM for the sake of the
> > combination of broken userspace and broken guest.
>
> The only counter argument I have in this regard is that some change is
> also needed to expose all CPUs to Windows guest even when the
> userspace does its best. It may result in odd scheduling, but still
> gives the best throughput.

But that'd be a new ABI, which again would require buy-in from
userspace. Maybe there is scope for an all CPUs, cycle-counter only
PMUv3 exposed to the guest, but that cannot be set automatically, as
we would otherwise regress existing setups.

At this stage, and given that you need to change userspace, I'm not
sure what the best course of action is.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.