Re: [PATCH] KVM: VMX: Make setup_vmcs_config() preemption disabled

From: Huang, Kai
Date: Tue Mar 07 2023 - 20:20:48 EST


On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote:
> On Thu, Mar 02, 2023, Huang, Kai wrote:
> > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote:
> > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
> > > > Make setup_vmcs_config() preemption disabled so it always performs on
> > > > the same local cpu.
> > > >
> > > > During module loading time, KVM intends to call setup_vmcs_config() to
> > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(),
>
> That may have been the very original intention, but I don't think it has been the
> true intention for a very long time.

Wondering what's the current intention?

>
> > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and
> > > > call the latter directly in the compatibility check code path. Change
> > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption
> > > > disabled so __setup_vmcs_config() is always done on the same cpu.
> > >
> > > Maybe you can simply disable preemption in hardware_setup() although I
> > > don't have a strong preference.
> > >
> > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
> > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?
> >
> > Yes I think so. I missed this :)
> >
> > Not sure whether there are other similar places too even outside of
> > hardware_setup().
> >
> > But compatibility check only checks things calculated via setup_vmcs_config()
> > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put
> > hardware_setup() inside preemption disabled.
>
> Disabling preemption across hardware_setup() isn't feasible as there are a number
> of allocations that might sleep. But disabling preemption isn't necessary to
> ensure setup runs on one CPU, that only requires disabling _migration_. So _if_
> we want to handle this in the kernel, we could simply do:
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 541982de5762..9126fdf02649 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> int r;
>
> mutex_lock(&vendor_module_lock);
> + migrate_disable();
> r = __kvm_x86_vendor_init(ops);
> + migrate_enable();
> mutex_unlock(&vendor_module_lock);
>
> return r;
>
>
> But I'm not convinced we should handle this in the kernel. Many of the checks,
> especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform
> setup on a single CPU, all of those would need to be converted to this_cpu_has().
>
> Some of those boot_cpu_has() calls should be changed regardless of whether or not
> migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy
> due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after
> open coding cpu_has_svm() into kvm_is_svm_supported()[*]).
>
> But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(),
> because the teardown path needs to mirror the setup path, e.g. if KVM ended up
> running on frankenstein hardware where not all CPUs have a constant TSC, KVM could
> leave a callback dangling and hose the kernel. Obviously such hardware wouldn't
> correctly run VMs, but crashing the kernel is a touch worse than KVM not working
> correctly.
>
> I'm not totally against converting to this_cpu_has() for the setup, as it would be
> more intuitive in a lot of ways. But, I don't think pinning the task actually
> hardens KVM in a meaningful way. If there are any divergences between CPUs, then
> either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will
> never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants
> thereof) without sanity checking across CPUs. And if userspace _really_ wants to
> have guarantees about how setup is performed, e.g. for repeatable, deterministic
> behavior, then userspace should force loading of KVM to be done on CPU0.

My intention is never for userspace, but simply/purely from compatibility
check's point of view (see below). Also, I don't think userspace wants to
guarantee anything -- it just wants to load the KVM module. It's even arguable
that it may be an acceptable behaviour to fail to run any VM even loading module
was successful.

>
> So my vote is to leave things as-is (modulo the cpu_has_svm() mess). But maybe add
> documentation to explain the caveats about loading KVM, and how userspace can
> mitigate those caveats?

I made this patch because I have some other patches to move VMXON support out of
KVM in order to support TDX, but so far those patches are not included in that
series (and I'd like to leave it out if we really don't need it).

In the patch to move VMXON out of KVM, I changed to use per-cpu variable to
cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is
becoming online. And setup_vmcs_config() is changed to use __this_cpu_read() to
read the per-cpu MSR value instead of reading from hardware. Obviously w/o
preempt_disable() or similar __this_cpu_read() can report kernel bug:

printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n",
what1, what2, preempt_count() - 1, current->comm, current->pid);

That being said, I am fine to keep existing code, even w/o documenting. We can
discuss more how to handle when we really want to move VMXON out of KVM (i.e.
supporting TDX IO?).

Or we can just fix compatibility check part? For instance, move
setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in
hardware_setup() and call preempt_disable() around them?