Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module
From: Huang, Kai
Date: Mon Apr 22 2024 - 21:46:01 EST
On Tue, 2024-04-23 at 13:34 +1200, Kai Huang wrote:
> >
> > > > And the intent isn't to catch every possible problem. As with many sanity checks,
> > > > the intent is to detect the most likely failure mode to make triaging and debugging
> > > > issues a bit easier.
> > >
> > > The SEAMCALL will literally return a unique error code to indicate CPU
> > > isn't in post-VMXON, or tdx_cpu_enable() hasn't been done. I think the
> > > error code is already clear to pinpoint the problem (due to these pre-
> > > SEAMCALL-condition not being met).
> >
> > No, SEAMCALL #UDs if the CPU isn't post-VMXON. I.e. the CPU doesn't make it to
> > the TDX Module to provide a unique error code, all KVM will see is a #UD.
>
> #UD is handled by the SEAMCALL assembly code. Please see TDX_MODULE_CALL
> assembly macro:
>
> .Lseamcall_trap\@:
> /*
> * SEAMCALL caused #GP or #UD. By reaching here RAX contains
> * the trap number. Convert the trap number to the TDX error
> * code by setting TDX_SW_ERROR to the high 32-bits of RAX.
> *
> * Note cannot OR TDX_SW_ERROR directly to RAX as OR instruction
> * only accepts 32-bit immediate at most.
> */
> movq $TDX_SW_ERROR, %rdi
> orq %rdi, %rax
>
> ...
>
> _ASM_EXTABLE_FAULT(.Lseamcall\@, .Lseamcall_trap\@)
> .endif /* \host */
>
> >
> > > > > Btw, I noticed there's another problem, that is currently tdx_cpu_enable()
> > > > > actually requires IRQ being disabled. Again it was implemented based on
> > > > > it would be invoked via both on_each_cpu() and kvm_online_cpu().
> > > > >
> > > > > It also also implemented with consideration that it could be called by
> > > > > multiple in-kernel TDX users in parallel via both SMP call and in normal
> > > > > context, so it was implemented to simply request the caller to make sure
> > > > > it is called with IRQ disabled so it can be IRQ safe (it uses a percpu
> > > > > variable to track whether TDH.SYS.LP.INIT has been done for local cpu
> > > > > similar to the hardware_enabled percpu variable).
> > > >
> > > > Is this is an actual problem, or is it just something that would need to be
> > > > updated in the TDX code to handle the change in direction?
> > >
> > > For now this isn't, because KVM is the solo user, and in KVM
> > > hardware_enable_all() and kvm_online_cpu() uses kvm_lock mutex to make
> > > hardware_enable_nolock() IPI safe.
> > >
> > > I am not sure how TDX/SEAMCALL will be used in TDX Connect.
> > >
> > > However I needed to consider KVM as a user, so I decided to just make it
> > > must be called with IRQ disabled so I could know it is IRQ safe.
> > >
> > > Back to the current tdx_enable() and tdx_cpu_enable(), my personal
> > > preference is, of course, to keep the existing way, that is:
> > >
> > > During module load:
> > >
> > > cpus_read_lock();
> > > tdx_enable();
> > > cpus_read_unlock();
> > >
> > > and in kvm_online_cpu():
> > >
> > > local_irq_save();
> > > tdx_cpu_enable();
> > > local_irq_restore();
> > >
> > > But given KVM is the solo user now, I am also fine to change if you
> > > believe this is not acceptable.
> >
> > Looking more closely at the code, tdx_enable() needs to be called under
> > cpu_hotplug_lock to prevent *unplug*, i.e. to prevent the last CPU on a package
> > from being offlined. I.e. that part's not option.
>
> Yeah. We can say that. I almost forgot this :-)
>
> >
> > And the root of the problem/confusion is that the APIs provided by the core kernel
> > are weird, which is really just a polite way of saying they are awful :-)
>
> Well, apologize for it :-)
>
> >
> > There is no reason to rely on the caller to take cpu_hotplug_lock, and definitely
> > no reason to rely on the caller to invoke tdx_cpu_enable() separately from invoking
> > tdx_enable(). I suspect they got that way because of KVM's unnecessarily complex
> > code, e.g. if KVM is already doing on_each_cpu() to do VMXON, then it's easy enough
> > to also do TDH_SYS_LP_INIT, so why do two IPIs?
>
> The main reason is we relaxed the TDH.SYS.LP.INIT to be called _after_ TDX
> module initialization.
>
> Previously, the TDH.SYS.LP.INIT must be done on *ALL* CPUs that the
> platform has (i.e., cpu_present_mask) right after TDH.SYS.INIT and before
> any other SEAMCALLs. This didn't quite work with (kernel software) CPU
> hotplug, and it had problem dealing with things like SMT disable
> mitigation:
>
> https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@xxxxxxxxx/T/#mf42fa2d68d6b98edcc2aae11dba3c2487caf3b8f
>
> So the x86 maintainers requested to change this. The original proposal
> was to eliminate the entire TDH.SYS.INIT and TDH.SYS.LP.INIT:
>
> https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@xxxxxxxxx/T/#m78c0c48078f231e92ea1b87a69bac38564d46469
>
> But somehow it wasn't feasible, and the result was we relaxed to allow
> TDH.SYS.LP.INIT to be called after module initialization.
>
> So we need a separate tdx_cpu_enable() for that.
Btw, the ideal (or probably the final) plan is to handle tdx_cpu_enable()
in TDX's own CPU hotplug callback in the core-kernel and hide it from all
other in-kernel TDX users.
Specifically:
1) that callback, e.g., tdx_online_cpu() will be placed _before_ any in-
kernel TDX users like KVM's callback.
2) In tdx_online_cpu(), we do VMXON + tdx_cpu_enable() + VMXOFF, and
return error in case of any error to prevent that cpu from going online.
That makes sure that, if TDX is supported by the platform, we basically
guarantees all online CPUs are ready to issue SEAMCALL (of course, the in-
kernel TDX user still needs to do VMXON for it, but that's TDX user's
responsibility).
But that obviously needs to move VMXON to the core-kernel.
Currently, export tdx_cpu_enable() as a separate API and require KVM to
call it explicitly is a temporary solution.
That being said, we could do tdx_cpu_enable() inside tdx_enable(), but I
don't see it's a better idea.