Re: [PATCH] KVM: VMX: Make setup_vmcs_config() preemption disabled

From: Sean Christopherson
Date: Wed Mar 08 2023 - 16:03:33 EST


On Wed, Mar 08, 2023, Huang, Kai wrote:
> On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote:
> > On Thu, Mar 02, 2023, Huang, Kai wrote:
> > > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote:
> > > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
> > > > > Make setup_vmcs_config() preemption disabled so it always performs on
> > > > > the same local cpu.
> > > > >
> > > > > During module loading time, KVM intends to call setup_vmcs_config() to
> > > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(),
> >
> > That may have been the very original intention, but I don't think it has been the
> > true intention for a very long time.
>
> Wondering what's the current intention?

I don't think there's a deliberate "intention" beyond "does it work?". Like many
of the historical bits of KVM x86, I think this is a case of the original authors
_wanting_ to provide certain behavior, but not actually ensuring that behavior in
code.

> > > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and
> > > > > call the latter directly in the compatibility check code path. Change
> > > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption
> > > > > disabled so __setup_vmcs_config() is always done on the same cpu.
> > > >
> > > > Maybe you can simply disable preemption in hardware_setup() although I
> > > > don't have a strong preference.
> > > >
> > > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
> > > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?
> > >
> > > Yes I think so. I missed this :)
> > >
> > > Not sure whether there are other similar places too even outside of
> > > hardware_setup().
> > >
> > > But compatibility check only checks things calculated via setup_vmcs_config()
> > > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put
> > > hardware_setup() inside preemption disabled.
> >
> > Disabling preemption across hardware_setup() isn't feasible as there are a number
> > of allocations that might sleep. But disabling preemption isn't necessary to
> > ensure setup runs on one CPU, that only requires disabling _migration_. So _if_
> > we want to handle this in the kernel, we could simply do:
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 541982de5762..9126fdf02649 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> > int r;
> >
> > mutex_lock(&vendor_module_lock);
> > + migrate_disable();
> > r = __kvm_x86_vendor_init(ops);
> > + migrate_enable();
> > mutex_unlock(&vendor_module_lock);
> >
> > return r;
> >
> >
> > But I'm not convinced we should handle this in the kernel. Many of the checks,
> > especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform
> > setup on a single CPU, all of those would need to be converted to this_cpu_has().
> >
> > Some of those boot_cpu_has() calls should be changed regardless of whether or not
> > migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy
> > due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after
> > open coding cpu_has_svm() into kvm_is_svm_supported()[*]).
> >
> > But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(),
> > because the teardown path needs to mirror the setup path, e.g. if KVM ended up
> > running on frankenstein hardware where not all CPUs have a constant TSC, KVM could
> > leave a callback dangling and hose the kernel. Obviously such hardware wouldn't
> > correctly run VMs, but crashing the kernel is a touch worse than KVM not working
> > correctly.
> >
> > I'm not totally against converting to this_cpu_has() for the setup, as it would be
> > more intuitive in a lot of ways. But, I don't think pinning the task actually
> > hardens KVM in a meaningful way. If there are any divergences between CPUs, then
> > either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will
> > never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants
> > thereof) without sanity checking across CPUs. And if userspace _really_ wants to
> > have guarantees about how setup is performed, e.g. for repeatable, deterministic
> > behavior, then userspace should force loading of KVM to be done on CPU0.
>
> My intention is never for userspace, but simply/purely from compatibility
> check's point of view (see below). Also, I don't think userspace wants to
> guarantee anything -- it just wants to load the KVM module.

That very much depends on the use case. For personal usage of KVM, it's extremely
unlikely that userspace is doing anything remotely sophisticated. But for a more
"formal" deployment, userspace absolutely has its hands all over the system, e.g.
scheduling VMs across systems, monitoring the health of the system, etc. Whether
or not userspaces actually do tightly control loading KVM is another matter...

> It's even arguable that it may be an acceptable behaviour to fail to run any
> VM even loading module was successful.
>
> >
> > So my vote is to leave things as-is (modulo the cpu_has_svm() mess). But maybe add
> > documentation to explain the caveats about loading KVM, and how userspace can
> > mitigate those caveats?
>
> I made this patch because I have some other patches to move VMXON support out of
> KVM in order to support TDX, but so far those patches are not included in that
> series (and I'd like to leave it out if we really don't need it).

Me too. :-)

> In the patch to move VMXON out of KVM, I changed to use per-cpu variable to
> cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is
> becoming online. And setup_vmcs_config() is changed to use __this_cpu_read() to
> read the per-cpu MSR value instead of reading from hardware. Obviously w/o
> preempt_disable() or similar __this_cpu_read() can report kernel bug:
>
> printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n",
> what1, what2, preempt_count() - 1, current->comm, current->pid);
>
> That being said, I am fine to keep existing code, even w/o documenting. We can
> discuss more how to handle when we really want to move VMXON out of KVM (i.e.
> supporting TDX IO?).
>
> Or we can just fix compatibility check part? For instance, move
> setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in
> hardware_setup() and call preempt_disable() around them?

Eh, the compatibility checks we really care about run in IRQ context, i.e. they're
guaranteed to have a stable CPU. Splitting the _setup_ for the compatibility
checks across multiple CPUs isn't a problem because KVM will still get the right
"answer", i.e. any divergence will be detected (barring _very_ flaky hardware that
might get false negatives anyways).

Don't get me wrong, I agree it's ugly, but I don't want to go halfway. I either
want to guard the whole thing, or nothing, and I can't convince myself that
guarding everything is worthwhile since userspace can (and IMO should) do a better
job.