Re: [RFC 0/3] Export APICv-related state via binary stats interface

From: Alejandro Jimenez
Date: Tue Apr 23 2024 - 16:44:19 EST




On 4/17/24 05:48, Paolo Bonzini wrote:
On Wed, Apr 17, 2024 at 12:35 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
The hiccup with stats are that they are ABI, e.g. we can't (easily) ditch stats
once they're added, and KVM needs to maintain the exact behavior.

Stats are not ABI---why would they be?

Because they exposed through an ioctl(), and userspace can and will use stats for
functional purposes? Maybe I just had the wrong takeaway from an old thread about
adding a big pile of stats[1], where unfortunately (for me) you weighed in on
whether or not tracepoints are ABI, but not stats.

I think we can agree that:
- you don't want hundreds of stats (Marc's point)
- a large part of the stats are very stable, but just as many (while
useful) depend on details which are very much implementation dependent
- a subset of stats is pretty close to being ABI (e.g. guest_mode),
but others can change meaning depending on processor model, guest
configuration and/or guest type (e.g. all of them affect
interrupt-related stats due to APICv inhibits).

While there are exceptions, the main consumer of stats (but indeed not
the only one) is intended to be the user, not a program. This is the
same as tracepoints, and it's why the introspection bits exist.
(User-friendliness also means that bitmask stats are "ouch"; I guess
we could add support for bit-sized boolean stats is things get out of
control).

For many stats, using them for functional purposes would be
wrong/dumb. You have to be ready for the exact behavior to change even
if the stats remain the same. If userspace doesn't, it's being dumb.
KVM can't be blocked from supporting new features just because they
"break" stats, and shouldn't be blocked from adding useful debugging
stats just because userspace could be dumb.

[ trim ]


That said, I'm definitely not opposed to stats _not_ being ABI, because that would
give us a ton of flexibility. E.g. we have a non-trivial number of internal stats
that are super useful _for us_, but are rather heavy and might not be desirable
for most environments. If stats aren't considered ABI, then I'm pretty sure we
could land some of the more generally useful ones upstream, but off-by-default
and guarded by a Kconfig. E.g. we have a pile of stats related to mmu_lock that
are very helpful in identifying performance issues, but they aren't things I would
want enabled by default.

That would be great indeed.

Not everything makes a good stat but, if in doubt and it's cheap
enough to collect it, go ahead and add it.

Marc raised the (IMO) valid concern that "if it's cheap, add it" will lead to
death by a thousand cuts. E.g. add a few hundred vCPU stats and suddenly vCPUs
consumes an extra KiB or three of memory.

A few APIC stats obviously aren't going to move the needle much, I'm just pointing
out that not everyone agrees that KVM should be hyper permissive when it comes to
adding new stats.

Yeah, that's why I made it conditional to "if in doubt". "Stats are
not ABI" is not a free pass to add anything, also because the truth is
closer than "Stats are generally not ABI but keeping them stable is a
good idea". Many more stats are obviously bad to have upstream, than
there are good ones; and when adding stats it makes sense to consider
their stability but without making it an absolute criterion for
inclusion.

So for this patch, I would weigh advantages to be substantial:
+ APICv inhibits at this point are relatively stable
+ the performance impact is large enough that APICv/AVIC stats _can_
be useful, both boolean and cumulative ones; so for example I'd add an
interrupt_injections stat for unaccelerated injections causing a
vmexit or otherwise hitting lapic.c.

So far it seems there is support for using the stats interface to expose:

- APICv status: (apicv_enabled, boolean, per-vCPU)
- Windows guest using SynIC's AutoEOI: (synic_auto_eoi_used, boolean, per-VM)
- KVM PIT in reinject mode inhibits AVIC: (pit_reinject_mode, boolean, per-VM)
- APICv unaccelerated injections causing a vmexit (i.e. AVIC_INCOMPLETE_IPI,
AVIC_UNACCELERATED_ACCESS, APIC_WRITE): (apicv_unaccelerated_inj, counter,
per-vCPU)

You'll noticed that I framed the AutoEOI usage and PIT reinject mode as their
own stats, as opposed to linking them to APICv inhibits, so it requires a
certain level of knowledge about the APICv limitations to make the connection.
Is this the preferred approach or would we want to associate them more
explicitly to APICv inhibitions?

Alternatively...


But absolutely would not go with a raw bitmask because:
- the exact set of inhibits is subject to change
- super high detail into niche APICv inhibits is unlikely to be useful
- many if not most inhibits are trivially derived from VM configuration


I wasn't able to fully make my case during the last PUCK meeting, so I'll try
here again. I'd like to insist on the opportunity to provide visibility into
APICv state with minimal change by exposing the current inhibit reasons. I am
aware that Sean and Paolo have expressed opposition to it, so I'll only insist
one more time:

Putting aside valid concerns about setting a precedent for exposing a bitmask
via stats, I'd argue in this case the apicv_inhibit_reasons is essentially a
tri-state variable. i.e. we can have a per-vCPU stat with possible values:

0 --> APICv is enabled and active
1 --> APICv is disabled (either by module parameter or lack of HW support)
1 --> APICv is disabled due to other reason(s)

The >1 values must be decoded using the kvm_host.h header, same as we are forced
to do for tracepoints until (shameless plug):
https://lore.kernel.org/all/20240214223554.1033154-1-alejandro.j.jimenez@xxxxxxxxxx/
is hopefully merged.

So as long as the method for decoding is documented in KVM docs (which would I
add as a patch in the series), a single stat can fully expose the APICv state
and be useful for debugging too. It could replace three of the stats proposed:
apicv_enabled, synic_auto_eoi_used, pit_reinject_mode.

Cons:
- The exact set of inhibits is subject ot change / detail into niche APICv
inhibits is unlikely to be useful.

The state described by 0 and 1 values of the stat are unlikely to change, and it
costs nothing to provide additional meaning in the remaining bits. The fact that
inhibits are likely to change/grow makes this approach better, since we avoid
creating stats to describe scenarios that become obsolete e.g. Synic AutoEOI is
fully deprecated, or split irqchip becomes the default..

- many if not most inhibits are trivially derived from VM configuration

This assumes non-trivial and implementation specific knowledge about VMM/guest
defaults, and limitations of the AVIC/APICv host hardware, which have also been
known to change e.g. Milan (unofficially) supports AVIC, but not x2AVIC.

Some inhibits will be set by default currently e.g. QEMU uses in-kernel irqchip
by default since pc-q35-4.0.1, and KVM defaults to reinject mode when creating a
PIT.

ExtINT can be inferred via IRQ window exits which are rare, but on the reverse,
noticing that AVIC is inhibited due to an IRQ window request can draw attention
to problems in that unlikely code path, as I recently found while debugging an
issue on OVMF + Secure Boot. (anecdotal)

VMMs other than QEMU and/or guests other than Linux can be inhibited due to
"less common" reasons than the ones we see today. (hypothetical)

--

I understand concerns about overloading the stats interface, but seems a lost
opportunity and a handicap to usability if we provide a boolean for APICv,
but no information about the reasons why it was not be enabled, specially
since in some real scenarios (e.g. hosts enabling kernel lockdown in
confidentiality mode, normally paired with Secure Boot) that information will
not be accessible via other mechanisms like tracepoints or BPF.

If the idea of exposing the inhibit reasons is still a NACK, then please let me
know if you agree with the 4 proposed stats mentioned at the beginning and I'll
send a v2 that implements them.

Thank you,
Alejandro



Paolo