Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
From: Jim Mattson
Date: Tue Dec 03 2024 - 20:13:36 EST
On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > the overhead of intercepting these frequent MSR reads, allow the guest
> > to read them directly by loading guest values into the hardware MSRs.
> >
> > These MSRs are continuously running counters whose values must be
> > carefully tracked during all vCPU state transitions:
> > - Guest IA32_APERF advances only during guest execution
>
> That's not what this series does though. Guest APERF advances while the vCPU is
> loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> while the vCPU is actively executing in the guest.
>
> E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> file-backed swap will not due to the task being scheduled out while waiting on I/O.
Are you saying that APERF should stop completely outside of VMX
non-root operation / guest mode?
While that is possible, the overhead would be significantly
higher...probably high enough to make it impractical.
> In general, the "why" of this series is missing. What are the use cases you are
> targeting? What are the exact semantics you want to define? *Why* did are you
> proposed those exact semantics?
I get the impression that the questions above are largely rhetorical,
and that you would not be happy with the answers anyway, but if you
really are inviting a version 2, I will gladly expound upon the why.
> E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> requires userspace exits will not. It's not necessarily wrong for heavy userspace
> I/O to cause observed frequency to drop, but it's not obviously correct either.
>
> The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> desirable for an oversubscribed setup requires a lot more work than defining
> semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> less just partitioned. Not to mention the complexity for trying to support all
> potential use cases is likely quite a bit higher.
>
> And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> at all times? Outside of maybe a few CPUs running bookkeeping tasks, the only
> workloads running on CPUs should be vCPUs. It's not clear to me that observing
> the guest utilization is outright wrong in that case.
My understanding is that Google Cloud customers have been asking for
this feature for all manner of VM families for years, and most of
those VM families are not slice-of-hardware, since we just launched
our first such offering a few months ago.
> One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> the kernel. I.e. let the system admin gift APERF/MPERF to KVM.
Part of our goal has been to enable guest APERF/MPERF without
impacting the use of host APERF/MPERF, since one of the first things
our support teams look at in response to a performance complaint is
the effective frequencies of the CPUs as reported on the host.
I can explain all of this in excruciating detail, but I'm not really
motivated by your initial response, which honestly seems a bit
hostile. At least you looked at the code, which is a far warmer
reception than I usually get.