On Wed, Apr 17, 2024, Kai Huang wrote:
On Tue, 2024-04-16 at 13:58 -0700, Sean Christopherson wrote:
On Fri, Apr 12, 2024, Kai Huang wrote:
On 12/04/2024 2:03 am, Sean Christopherson wrote:
On Thu, Apr 11, 2024, Kai Huang wrote:
I can certainly follow up with this and generate a reviewable patchset if I
can confirm with you that this is what you want?
Yes, I think it's the right direction. I still have minor concerns about VMX
being enabled while kvm.ko is loaded, which means that VMXON will _always_ be
enabled if KVM is built-in. But after seeing the complexity that is needed to
safely initialize TDX, and after seeing just how much complexity KVM already
has because it enables VMX on-demand (I hadn't actually tried removing that code
before), I think the cost of that complexity far outweighs the risk of "always"
being post-VMXON.
Does always leaving VMXON have any actual damage, given we have emergency
virtualization shutdown?
Being post-VMXON increases the risk of kexec() into the kdump kernel failing.
The tradeoffs that we're trying to balance are: is the risk of kexec() failing
due to the complexity of the emergency VMX code higher than the risk of us breaking
things in general due to taking on a ton of complexity to juggle VMXON for TDX?
After seeing the latest round of TDX code, my opinion is that being post-VMXON
is less risky overall, in no small part because we need that to work anyways for
hosts that are actively running VMs.
How about we only keep VMX always on when TDX is enabled?
Paolo also suggested that forcing VMXON only if TDX is enabled, mostly because
kvm-intel.ko and kvm-amd.ko may be auto-loaded based on MODULE_DEVICE_TABLE(),
which in turn causes problems for out-of-tree hypervisors that want control over
VMX and SVM.
I'm not opposed to the idea, it's the complexity and messiness I dislike. E.g.
the TDX code shouldn't have to deal with CPU hotplug locks, core KVM shouldn't
need to expose nolock helpers, etc. And if we're going to make non-trivial
changes to the core KVM hardware enabling code anyways...
What about this? Same basic idea as before, but instead of unconditionally doing
hardware enabling during module initialization, let TDX do hardware enabling in
a late_hardware_setup(), and then have KVM x86 ensure virtualization is enabled
when creating VMs.
This way, architectures that aren't saddled with out-of-tree hypervisors can do
the dead simple thing of enabling hardware during their initialization sequence,
and the TDX code is much more sane, e.g. invoke kvm_x86_enable_virtualization()
during late_hardware_setup(), and kvm_x86_disable_virtualization() during module
exit (presumably).
+int kvm_x86_enable_virtualization(void)
+{
+ int r;
+
+ guard(mutex)(&vendor_module_lock);
+
+ if (kvm_usage_count++)
+ return 0;
+
+ r = kvm_enable_virtualization();
+ if (r)
+ --kvm_usage_count;
+
+ return r;
+}
+EXPORT_SYMBOL_GPL(kvm_x86_enable_virtualization);
+
+int kvm_enable_virtualization(void)
{
+ int r;
+
+ r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
+ kvm_online_cpu, kvm_offline_cpu);
+ if (r)
+ return r;
+
+ register_syscore_ops(&kvm_syscore_ops);
+
+ /*
+ * Manually undo virtualization enabling if the system is going down.
+ * If userspace initiated a forced reboot, e.g. reboot -f, then it's
+ * possible for an in-flight module load to enable virtualization
+ * after syscore_shutdown() is called, i.e. without kvm_shutdown()
+ * being invoked. Note, this relies on system_state being set _before_
+ * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
+ * or this CPU observes the impedning shutdown. Which is why KVM uses
+ * a syscore ops hook instead of registering a dedicated reboot
+ * notifier (the latter runs before system_state is updated).
+ */
+ if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
+ system_state == SYSTEM_RESTART) {
+ unregister_syscore_ops(&kvm_syscore_ops);
+ cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
+ return -EBUSY;
+ }
+
return 0;
}