Re: [PATCH v7 3/4] KVM: arm64: PMU: Introduce FIXED_COUNTERS_ONLY

From: Akihiko Odaki

Date: Mon Apr 20 2026 - 08:13:18 EST

On 2026/04/20 18:51, Marc Zyngier wrote:

On Mon, 20 Apr 2026 09:36:16 +0100,
Akihiko Odaki <odaki@xxxxxxxxxxxxxxxxxxxxxx> wrote:

On 2026/04/20 2:19, Marc Zyngier wrote:

On Sat, 18 Apr 2026 09:14:25 +0100,
Akihiko Odaki <odaki@xxxxxxxxxxxxxxxxxxxxxx> wrote:

On a heterogeneous arm64 system, KVM's PMU emulation is based on the
features of a single host PMU instance. When a vCPU is migrated to a
pCPU with an incompatible PMU, counters such as PMCCNTR_EL0 stop
incrementing.

Although this behavior is permitted by the architecture, Windows does
not handle it gracefully and may crash with a division-by-zero error.

The current workaround requires VMMs to pin vCPUs to a set of pCPUs
that share a compatible PMU. This is difficult to implement correctly in
QEMU/libvirt, where pinning occurs after vCPU initialization, and it
also restricts the guest to a subset of available pCPUs.

Introduce the KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY attribute to
create a "fixed-counters-only" PMU. When set, KVM exposes a PMU that is
compatible with all pCPUs but that does not support programmable
event counters which may have different feature sets on different PMUs.

This allows Windows guests to run reliably on heterogeneous systems
without crashing, even without vCPU pinning, and enables VMMs to
schedule vCPUs across all available pCPUs, making full use of the host
hardware.

Much like KVM_ARM_VCPU_PMU_V3_IRQ and other read-write attributes, this
attribute provides a getter that facilitates kernel and userspace
debugging/testing.

OK, so that's the sales pitch. But how is it implemented? I would like
to be able to read a high-level description of the implementation
trade-offs.

Implementation-wise it is very trivial. Essentially the following
addition in kvm_arm_pmu_v3_get_attr() is the entire implementation:
+ case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
+ if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
&vcpu->kvm->arch.flags))
+ return 0;

Both its functionality and code complexity is trivial. So we can argue that:
- the functionality is too trivial to be useful or
- the interface/implementation complexity is so trivial that it does not
incur maintenance burden

In this case the selftest uses the getter so I was more inclined to
have it, but adding one just for the selftest sounds too ad-hoc, so
here I looked into other attributes to ensure that it was not
introducing inconsistency with existing interfaces.

As the result, I found there are other read-write attributes; in fact
there are more read-write attributes than write-only ones.

You're completely missing the point. I'm referring to the whole of the
commit message, which is more of a marketing slide than a technical
description.

In terms of implementation, the obvious tradeoff is that it adds more code to implement the feature. One thing to note is that kvm_vcpu_load_pmu() is added and is called each time a vCPU migrates across pCPUs. The heavy part, making the KVM_REQ_RELOAD_PMU request, only happens when the feature is enabled.

I really don't care about the getter at this stage, which while
pointless, does not make things more awful than they already are.

Signed-off-by: Akihiko Odaki <odaki@xxxxxxxxxxxxxxxxxxxxxx>
---
Documentation/virt/kvm/devices/vcpu.rst | 29 ++++++
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/pmu-emul.c | 155 +++++++++++++++++++++++---------
include/kvm/arm_pmu.h | 2 +
6 files changed, 147 insertions(+), 43 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 60bf205cb373..e0aeb1897d77 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -161,6 +161,35 @@ explicitly selected, or the number of counters is out of range for the
selected PMU. Selecting a new PMU cancels the effect of setting this
attribute.
+1.6 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
+------------------------------------------------------
+
+:Parameters: no additional parameter in kvm_device_attr.addr
+
+:Returns:
+
+ ======= =====================================================
+ -EBUSY Attempted to set after initializing PMUv3 or running
+ VCPU, or attempted to set for the first time after
+ setting an event filter
+ -ENXIO Attempted to get before setting
+ -ENODEV Attempted to set while PMUv3 not supported
+ ======= =====================================================
+
+If set, PMUv3 will be emulated without programmable event counters. The VCPU
+will use any compatible hardware PMU. This attribute is particularly useful on

Not quite "any PMU". It will use *the* PMU of the physical CPU,
irrespective of the implementation.

I think:

- this comment
- one on the KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED note
- one on kvm_pmu_create_perf_event()
- and one on kvm_arm_pmu_v3_set_pmu_fixed_counters_only()

All boil down into one question: will it support all possible CPUs, or
will it support a subset? Let me answer here:

This patch is written to support a subset instead of all possible
CPUs. If a pCPU does not have a compatible PMU, the pCPU will not be
supported and cause KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED.

This is not a thing. Either *all* the CPUs have a PMU that can be used
for KVM, or PMU support is not offered to guests. That's a hard line
in the sand. And the code already upholds this by checking the
sanitised PMUVer field.

This patch does not enforce all possible CPUs are covered by the
compatible PMUs. Theoretically speaking,
kvm_arm_pmu_get_pmuver_limit() enables the PMU emulation when real
PMUv3 hardware covers all possible CPUs *or* the relevant registers
can be trapped with IMPDEF, so some pCPU may not have a compatible PMU
and only provide the IMPDEF trapping.

How is that possible? Please describe the case where that can happen,
and I will make sure that such a system stops booting. The intent is
definitely that that:

- for early CPUs, we take the minimal capability of all CPUs

- for late CPUs, either they match at least the capability recorded by
early CPUs, or they don't boot.

All CPUs may trap the relevant registers with IMPDEF but some of them may not have compatible PMUs. As I wrote in the previous email, I don't think it will happen in practice.

Practically, I don't think any sane configuration will ever have such
a subset support, so we can explicitly enforce all possible CPUs are
covered by the compatible PMUs if desired.

That's not just desired. This is a requirement. And it is already
enforced AFAICS.

+heterogeneous systems where different hardware PMUs cover different physical
+CPUs. The compatibility of hardware PMUs can be checked with
+KVM_ARM_VCPU_PMU_V3_SET_PMU. All VCPUs in a VM share this attribute. It isn't
+possible to set it for the first time if a PMU event filter is already present.

"for the first time" gives the impression that it will work if you try
again. I'd rather we say that "This feature is incompatible with the
existence of a PMU event filter".

The following sequence will work:
1. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
2. Set KVM_ARM_VCPU_PMU_V3_FILTER
3. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY

This is to make the behavior conistent with KVM_ARM_VCPU_PMU_V3_SET_PMU.

I don't think this is correct. Filtering is completely at odds with
this patch, and I don't want to have to reason about the combination.

kvm_arm_pmu_v3_set_pmu() has the following condition:

if (kvm_vm_has_ran_once(kvm) ||
(kvm->arch.pmu_filter && kvm->arch.arm_pmu != arm_pmu)) {
ret = -EBUSY;
break;
}

kvm_arm_pmu_v3_set_pmu_fixed_counters_only() has the corresponding condition for consistency:

if (kvm_vm_has_ran_once(kvm) ||
(kvm->arch.pmu_filter &&
!test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
&kvm->arch.flags)))
return -EBUSY;

We can of course kill the PMU event filter for FIXED_COUNTERS_ONLY. The filter is effectively no-op with FIXED_COUNTERS_ONLY and I don't think that consistency matters much.

[...]

+ int i;
+
+ for_each_set_bit(i, &mask, 32) {
+ pmc = kvm_vcpu_idx_to_pmc(vcpu, i);
+ if (!pmc->perf_event)
+ continue;
+
+ cpu_pmu = to_arm_pmu(pmc->perf_event->pmu);
+ if (!cpumask_test_cpu(vcpu->cpu, &cpu_pmu->supported_cpus)) {
+ kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
+ break;
+ }
+ }
+}
+

Why do we need to inflict this on VMs that do not have the fixed
counter restriction?

This function is to re-create the perf_event in case the current
perf_event does not support the pCPU because e.g., the pCPU is a
E-core while the perf_event only covers the P-cores.

That's not what I meant. This code is only here to support the
fixed-function feature. It makes no sense outside of it, because *we
don't support counter migration across implementations*.

So what's the purpose of this stuff for the normal KVM setup?

None. It's only for this feature. We can add a check of the feature flag at the beginning of the function to avoid that loop.

And even then, all you have to reconfigure is the cycle counter. So
why the loop? All we want to find out is whether the cycle counter is
instantiated on the PMU that matches the current CPU.

I just wanted to avoid hardcoding assumptions on the fixed
counter(s). FEAT_PMUv3_ICNTR will be naturaly handled with a loop, for
example.

Well, not that loop, since ICNTR is counter 32. So please let's stop
the nonsense and only add what is required?

[...]

+
clear_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
&kvm->arch.flags);

Why does this need to be cleared? I'd rather we make sure it is never
set the first place.

KVM_ARM_VCPU_PMU_V3_SET_PMU and
KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY can be set on the same
VCPU. The last KVM_ARM_VCPU_PMU_V3_SET_PMU or
KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY setting will be effective.

A VMM may try set these attributes to check if the setting is
supported. For example, the RFC QEMU patch first uses
KVM_ARM_VCPU_PMU_V3_SET_PMU to find a compatible PMU that covers all
pCPUs, and then falls back to
KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY. The order of such probing is
up to the VMM.

KVM_ARM_VCPU_PMU_V3_SET_PMU is not a probing mechanism. You must probe
the PMUs by looking in /sys/bus/event_source/devices/, like kvmtool
does.

So there is no reason to support this stuff, and the two flags should
be made mutually exclusive.

Thanks for the pointer. I'll make a change to make the flags mutually exclusive and test it with an amended QEMU patch that follows what kvmtool does.

[...]

In conclusion, I find this patch to be rather messy. For a start, it
needs to be split in at least 5 patches:

- at least two for the refactoring
- one for the PMU core changes
- one for the UAPI
- one for documentation

That clarifies the expected granurarity of patches. The next version
will be in that layout, perhaps with more patches if an additional
change. Thanks for the guidance.

I'd also like some clarification on how this is intended to work if we
enable FEAT_PMUv3_ICNTR, because the definition seems to be designed
to encompass all fixed-function counters, and I expect this to grow
over time.

Indeed the UAPI was designed to encompass all fixed-function counters
as suggested by Oliver.

To support the UAPI, the implementation avoids hardcoding the
assumption on the fixed counter(s). FEAT_PMUv3_INCTR will be naturaly
supported once the common code is properly updated (i.e., the size of
the event counter bitmask is grown the corresponding registers are
wired up with a proper check of the feature.)

I expect migration will be handled with the conventional register
getters and setters, but please share if you have a concern.

At the very least I want to see some documentation explaining that.

What kind of documentation do you expect? If we change kvm_vcpu_load_pmu() to avoid for_each_set_bit(), there would be a good chance to forget updating it when mechanically updating existing for_each_set_bit() instances, so it is a candidate for documentation. But I don't have a good idea where to place it either.

Regards,
Akihiko Odaki