Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

From: Akihiko Odaki
Date: Fri Mar 21 2025 - 02:21:02 EST


On 2025/03/21 2:14, Marc Zyngier wrote:
On Thu, 20 Mar 2025 09:52:59 +0000,
Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/20 18:10, Marc Zyngier wrote:
On Thu, 20 Mar 2025 06:03:35 +0000,
Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/20 3:51, Oliver Upton wrote:
On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote:
On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:
What about setting the flag automatically when a user fails to pin
vCPUs to CPUs that are covered by one PMU? There would be no change if
a user correctly pins vCPUs as it is. Otherwise, they will see a
correct feature set advertised to the guest and the cycle counter
working.

How do you know that the affinity is "correct"? VCPU affinity can be
changed at any time. I, for one, do not want my VMs to change
behaviour because I let the vcpus bounce around as the scheduler sees
fit.

Checking the affinity when picking the default PMU; the vCPU affinity
is the only thing that rules the choice of the default PMU even now.

Perhaps we may model the API as follows: introduce another "composite"
PMU that works on any core but only exposes the cycle counter. Robust
VMMs will choose it or one of hardware PMUs with
KVM_ARM_VCPU_PMU_V3_SET_PMU. KVM will choose the default PMU according
to the vCPU affinity at the point of KVM_ARM_VCPU_INIT otherwise. If
the affinity is covered by one hardware PMU, that PMU will be chosen
as the default. The "composite" PMU will be the default otherwise.

This makes no sense to me. A VCPU is always affine to a PMU, because
we do not support configurations where only some CPUs have a PMU. This
is an all-or-nothing situation.

At least isn't it fine to have the composite PMU with a new value for
KVM_ARM_VCPU_PMU_V3_SET_PMU?

Not sure KVM_ARM_VCPU_PMU_V3_SET_PMU is the right hook (it takes a PMU
'type', which is under control of the perf subsystem). But if we can
find a value that is guaranteed to be unique, why not.

More importantly, you keep suggesting the same "new default", and I
keep saying NO.

My position is clear: if you want a *new* behaviour, you *must* add a
new flag that the VMM explicitly provides to enable this CC-only PMU.
No change in default behaviour at all.

I'm not going to move from that.

Why not? It will not break anything guaranteed to work in the past.

It doesn't have to be guaranteed. It just has to *exist*. That's the
Linux ABI for you.

Currently KVM only guarantees that the emulated PMU correctly counts
only when
1) the vCPU affinity is contained by one PMU and
2) it will not expand

Breaking these conditions will make the behavior of the emulated PMU
undefined. Now I'm proposing to remove 1).

And I'm saying no to that. I'm also getting tired of arguing the same
point on and on.

We currently have two different behaviours:

- either you explicitly set a PMU, and the affinity of this PMU
constraints the affinity of the vcpus. The vcpus are not allowed to
run outside of this affinity. Everything counts all the time.

- or you don't explicitly set a PMU, and a default PMU will be picked
from the current affinity of the first vcpu. Your vcpus will be able
to run anywhere, but the virtual PMU will *only* count when the
vcpus are affine to the default PMU. When the vcpus are not affine
to default PMU, *nothing* counts.

These two behaviours are ABI. They are not changing. They don't get
relaxed, they don't get tightened, they stay just as they are,
forever.

Is the latter one is really ABI? I see it as part of behaviors that are undefined and not included in ABI for the following reasons:

1) It depends on the scheduler behavior, which cannot be ABI.

2) It provides a broken PMU so the proposed behavioral change is similar a bug fix though I call it a undefined behavior instead of a bug as it is explicitly told that there is no assurance that the PMU works in such a scenario.

3) The userspace could not have relied on it so the "no regressions" rule cannot be applied here; how can anyone have a userspace that relies on a kernel behavior that depends on scheduling?

But for 3), Oliver raised a concern for the guest compatibility so I'd like to hear an explanation for that concern.


You want a *third* behaviour, go ahead. Define it the way you want.
But the behaviours described above will stay unchanged.> > I'm looking forward to your patches implementing it, but I am also
done arguing on it.

I understand the discussion is tiring but I want to know the reasoning behind such a design decision before sending an RFC patch to a VMM (QEMU) so that I can explain them why it is necessary in turn.

Regards,
Akihiko Odaki


M.