Re: [PATCH v2 03/49] KVM: x86: Account for KVM-reserved CR4 bits when passing through CR4 on VMX

From: Maxim Levitsky
Date: Wed Jul 24 2024 - 13:28:41 EST


On Tue, 2024-07-09 at 12:58 -0700, Sean Christopherson wrote:
> On Thu, Jul 04, 2024, Maxim Levitsky wrote:
> > On Fri, 2024-05-17 at 10:38 -0700, Sean Christopherson wrote:
> > > Drop x86.c's local pre-computed cr4_reserved bits and instead fold KVM's
> > > reserved bits into the guest's reserved bits. This fixes a bug where VMX's
> > > set_cr4_guest_host_mask() fails to account for KVM-reserved bits when
> > > deciding which bits can be passed through to the guest. In most cases,
> > > letting the guest directly write reserved CR4 bits is ok, i.e. attempting
> > > to set the bit(s) will still #GP, but not if a feature is available in
> > > hardware but explicitly disabled by the host, e.g. if FSGSBASE support is
> > > disabled via "nofsgsbase".
> > >
> > > Note, the extra overhead of computing host reserved bits every time
> > > userspace sets guest CPUID is negligible. The feature bits that are
> > > queried are packed nicely into a handful of words, and so checking and
> > > setting each reserved bit costs in the neighborhood of ~5 cycles, i.e. the
> > > total cost will be in the noise even if the number of checked CR4 bits
> > > doubles over the next few years. In other words, x86 will run out of CR4
> > > bits long before the overhead becomes problematic.
> >
> > It might be just me, but IMHO this justification is confusing, leading me to
> > belive that maybe the code is on the hot-path instead.
> >
> > The right justification should be just that this code is in
> > kvm_vcpu_after_set_cpuid is usually (*) only called once per vCPU (twice
> > after your patch #1)
>
> Ya. I was trying to capture that even if that weren't true, i.e. even if userspace
> was doing something odd, that the extra cost is irrelevant. I'll expand and reword
> the paragraph to make it clear this isn't a hot path for any sane userspace.
Thank you!

>
> > (*) Qemu also calls it, each time vCPU is hotplugged but this doesn't change
> > anything performance wise.
>
> ...
>
> > > @@ -9831,10 +9826,6 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> > > if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> > > kvm_caps.supported_xss = 0;
> > >
> > > -#define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
> > > - cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
> > > -#undef __kvm_cpu_cap_has
> > > -
> > > if (kvm_caps.has_tsc_control) {
> > > /*
> > > * Make sure the user can only configure tsc_khz values that
> >
> > I mostly agree with this patch - caching always carries risks and when it doesn't
> > value performance wise, it should always be removed.
> >
> >
> > However I don't think that this patch fixes a bug as it claims:
> >
> > This is the code prior to this patch:
> >
> > kvm_x86_vendor_init ->
> >
> > r = ops->hardware_setup();
> > svm_hardware_setup
> > svm_set_cpu_caps + kvm_set_cpu_caps
> >
> > -- or --
> >
> > vmx_hardware_setup ->
> > vmx_set_cpu_caps + + kvm_set_cpu_caps
> >
> >
> > # read from 'kvm_cpu_caps'
> > cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
> >
> >
> > AFAIK kvm cpu caps are never touched outside of svm_set_cpu_caps/vmx_hardware_setup
> > (they don't depend on some later post-processing, cpuid, etc).
> >
> > In fact a good refactoring would to make kvm_cpu_caps const after this point,
> > using cast, assert or something like that.
> >
> > This leads me to believe that cr4_reserved_bits is computed correctly.
>
> cr4_reserved_bits is computed correctly. The bug is that cr4_reserved_bits isn't
> consulted by set_cr4_guest_host_mask(), which is what I meant by "KVM-reserved
> bits" in the changelog.

Ah, I see it now.

I also see that set_cr4_guest_host_mask, limits the guest owned bits to a small
whitelist, and none of these bits looks scary, so it all make sense that this
is mostly a theoretical bug, but for sure worth fixing.


>
> > I could be wrong, but then IMHO it is a very good idea to provide an explanation
> > on how this bug can happen.
>
> The first paragraph of the changelog tries to do that, and I'm struggling to come
> up with different wording that makes it more clear what's wrong. Any ideas/suggestions?
>

I also re-read it, and now it all makes sense. I guess I just somehow got fixed on
thinking that cr4_reserved_bits was not computed incorrectly rather than just
not used.
The comment indeed now makes sense to me, so let it be as it is.


Reviewed-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>

Best regards,
Maxim Levitsky