Re: [RFC PATCH v3 2/2] x86: kvm: svm: advertise ERAPS (larger RSB) support to guests
From: Shah, Amit
Date: Thu Mar 27 2025 - 07:11:30 EST
Thanks for the review, Sean. Apologies for not replying sooner. I
hope you still have some context in mind. I have a question for you
below, but other than that, I'll send out a new rev next week.
On Mon, 2024-12-02 at 10:30 -0800, Sean Christopherson wrote:
> KVM: SVM:
>
> Please through the relevant maintainer handbooks, there are warts all
> over.
>
> Documentation/process/maintainer-kvm-x86.rst
> Documentation/process/maintainer-tip.rst
>
> And the shortlog is wrong. The patch itself is also broken. KVM
> should (a) add
> support for virtualizing ERAPS and (b) advertise support to
> *userspace*. The
> userspace VMM ultimately decides what to expose/enable for the guest.
Point taken - I interpreted it the way I wanted it to - but I can see
what you mean - I'll reword.
[...]
> > While the new default RSB size is used on the host without any
> > software
> > modification necessary, the RSB size for guests is limited to the
> > older
>
> Please describe hardware behavior, and make it abundantly clear that
> the changelog
> is talking about hardware behavior. One of my pet peeves
> (understatement) with
> the APM is that it does a shit job of explaining the actual
> architectural behavior.
OK, more rewording.
Hardware behaviour is to just use 64 entries on the host; and limit to
32 entries for any guest. If the right VMCB bits are set, like in the
patch, the hardware uses all the 64 entries for the guest.
> > value (32 entries) for backwards compatibility.
>
> Backwards compatibility with what? And how far back? E.g. have CPUs
> with a RAP
> always had 32 entries?
Yes, all CPUs upto Zen5 had 32 entries -- that number's even just hard-
coded in the definition of RSB_CLEAR_LOOPS, which is used for the RSB
stuffing.
The wording is borrowed a bit from the APM -- especially the "default"
value. With Zen5, the default hardware RSB size is 64. So to expose
the hardware-default size to the guest, this VMCB update is necessary;
else the guests use the non-default, trimmed-down 32 entry RSB.
>
> > to also use the default number of entries by setting
>
> "default" is clearly wrong, since the *default* behavior is to use
Yea; I get this is confusing. The hardware default is 64, but guest
default is 32. My intention with that stmt was to say use the native
default vs the guest default. Will reword.
> > the new ALLOW_LARGER_RAP bit in the VMCB.
>
> I detest the "ALLOW_LARGER" name. "Allow" implies the guest somehow
> has a choice.
> And "Larger" implies there's an even larger size
I agree, but this is the name used in the APM. What's the preference -
deviate from the APM naming, or stick to it, despite the confusion it
causes?
> > The two cases for backward compatibility that need special handling
> > are
> > nested guests, and guests using shadow paging
>
> Guests don't use shadow paging, *KVM* uses
>
> > (or when NPT is disabled):
>
> "i.e", not "or". "Or" makes it sound like "NPT is disabled" is
> separate case
> from shadow paging.
>
> > For nested guests: the ERAPS feature adds host/guest tagging to
> > entries
> > in the RSB, but does not distinguish between ASIDs. On a nested
> > exit,
> > the L0 hypervisor instructs the hardware (via another new VMCB bit,
>
> I strongly suspect this was copied from the APM. Please don't do
> that. State
> what change is being for *KVM*, not for "the L0 hypervisor". This
> verbiage mixes
> hardware behavior with software behavior, which again is why I hate
> much of the
> APM's wording.
This isn't necessarily from the APM; this is me describing what the hw
does, to set up why KVM needs to do a few other things.
I want to say that the hw is in control of host/guest tagging for the
RSB entries; but it does not tag ASIDs. So KVM running as L0
hypervisor needs to take certain steps. KVM running as L1 hypervisor
does not need to.
> > FLUSH_RAP_ON_VMRUN) to flush the RSB on the next VMRUN to prevent
> > RSB
> > poisoning attacks from an L2 guest to an L1 guest. With that in
> > place,
> > this feature can be exposed to guests.
>
> ERAPS can also be advertised if nested virtualization is disabled,
> no? I think
> it makes sense to first add support for ERAPS if "!nested", and then
> in a separate
> path, add support for ERAPS when nested virtualization is enabled.
> Partly so that
> it's easier for readers to understand why nested VMs are special, but
> mainly because
> the nested virtualization support is sorely lacking.
Yea, it makes sense to split these up, I'll try doing that.
Nested virt disabled is the common case, so it makes sense to make that
more obvious as well.
> > For shadow paging guests: do not expose this feature to guests;
> > only
> > expose if nested paging is enabled, to ensure a context switch
> > within
> > a guest triggers a context switch on the CPU -- thereby ensuring
> > guest
> > context switches flush guest RSB entries.
>
> Huh?
OK this is where I'm slightly confused as well - so tell me if I'm
wrong:
When EPT/NPT is disabled, and shadow MMU is used by kvm, the CR3
register on the CPU holds the PGD of the qemu process. So if a task
switch happens within the guest, the CR3 on the CPU is not updated, but
KVM's shadow MMU routines change the page tables pointed to by that
CR3. Contrasting to the NPT case, the CPU's CR3 holds the guest PGD
directly, and task switches within the guest cause an update to the
CPU's CR3.
Am I misremembering and misreading the code?
> > For shadow paging, the CPU's CR3 is not used for guest processes,
> > and hence
> > cannot benefit from this feature.
>
> What does that have to do with anything?
If what I wrote above is true, since CR3 does not change when guest
tasks switch, the RSB will not be flushed by hardware on guest task
switches, and so exposing ERAPS to this guest is wrong -- it'll relax
its own FILL_RETURN_BUFFER routines (via patch 1) and that's not what
we want when the hw is doing that.
[...]
> > @@ -1362,10 +1364,22 @@ static inline int __do_cpuid_func(struct
> > kvm_cpuid_array *array, u32 function)
> > case 0x80000020:
> > entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
> > break;
> > - case 0x80000021:
> > - entry->ebx = entry->ecx = entry->edx = 0;
> > + case 0x80000021: {
> > + unsigned int ebx_mask = 0;
> > +
> > + entry->ecx = entry->edx = 0;
> > cpuid_entry_override(entry, CPUID_8000_0021_EAX);
> > +
> > + /*
> > + * Bits 23:16 in EBX indicate the size of the RSB.
>
> Is this enumeration explicitly tied to ERAPS?
Not tied to ERAPS, but it happens to be defined at the same time as
ERAPS is being defined, and there's no other CPUID bit that says 'the
RSB size is now available in these bits'. So yea, the ERAPS CPUID bit
is being overloaded here. Existence of the ERAPS CPUID bit confirms
that the default RSB size is now different, and that the hardware
flushes the RSB (as in patch 1).
> > + * Expose the value in the hardware to the guest.
>
> __do_cpuid_func() is used to advertise KVM's supported CPUID to host
> userspace,
> not to the guest.
>
> Side topic, what happens when Zen6 adds EVEN_LARGER_RAP? Enumerating
> the size of
> the RAP suggets it's likely to change in the future.
It's just informational-only for now, and there aren't any plans to
extend the RAP anymore. But you're right - it could change in the
future.
> > + */
> > + if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
> > + ebx_mask |= GENMASK(23, 16);
> > +
> > + entry->ebx &= ebx_mask;
>
> This is a waste of code and makes it unnecessarily difficult to
> read. Just do:
>
> if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
> entry->ebx &= GENMASK(23, 16);
> else
> entry->ebx = 0;
Well I left it all this way to make it easier the the future to add
more ebx bits here, but sure - I'll just shorten it now.
[...]
Amit