Re: [PATCH] x86/tdx, KVM: fix HKID leak when kexec is initiated with active TDs

From: Edgecombe, Rick P

Date: Wed Apr 22 2026 - 10:29:33 EST

On Wed, 2026-04-22 at 06:14 -0700, Sean Christopherson wrote:
> On Wed, Apr 22, 2026, Robert Nowicki wrote:
> > When kexec is initiated while TDs are running, vCPU threads can be
> > mid-TDH.VP.ENTER on other CPUs when tdx_shutdown() fires. The TDX
> > module rejects TDH.MNG.VPFLUSHDONE for a VP in RUNNING state, leaving
> > the HKID in a leaked state:
> >
> >    kvm_intel: tdh_mng_vpflushdone() failed. HKID 33 is leaked.
> >
> > Fix this by introducing a quiescing flag set at the start of
> > tdx_shutdown(). KVM's tdx_vcpu_run() checks the flag and returns
> > EXIT_FASTPATH_NONE before attempting TDH.VP.ENTER. After setting the
> > flag, tdx_shutdown() calls on_each_cpu(tdx_seam_sync) with wait=1 to
> > ensure any CPU currently inside TDH.VP.ENTER has exited SEAM before
> > tdx_sys_disable() is called.
> >
> > Fixes: 58171ae22e11 ("x86/tdx: Disable the TDX module during kexec and
> > kdump")
>
> Please don't post seemingly standalone patches for code that hasn't yet been
> merged, it's quite confusing.

+1. Robert, we try to coordinate public Linux TDX work internally before posting
because there is so much of it, it gets confusing to community/maintainers.
Please check in with the Linux TDX developers before posing TDX patches so we
can have a cohesive effort.

>
> > u64 tdh_vp_enter(struct tdx_vp *vp, struct tdx_module_args *args);
> > u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page);
> > @@ -206,6 +207,7 @@ static inline u32 tdx_get_nr_guest_keyids(void) { return
> > 0; }
> > static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL;
> > }
> > static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return
> > NULL; }
> > static inline void tdx_sys_disable(void) { }
> > +static inline bool tdx_kexec_quiescing(void) { return false; }
> > #endif /* CONFIG_INTEL_TDX_HOST */
> >
> > #endif /* !__ASSEMBLER__ */
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 50a5cfdbd33e..2d658db7700d 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1053,6 +1053,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64
> > run_flags)
> >    struct vcpu_tdx *tdx = to_tdx(vcpu);
> >    struct vcpu_vt *vt = to_vt(vcpu);
> >
> > + if (unlikely(tdx_kexec_quiescing()))

There is essentially an existing kexec race, where vmxoff happens when SEAMCALLs
could still happen. It goes back to the first TDX kexec support (i.e. not
introduced by vmxon refactor). VMX KVM has some spurious logic to handle
something similar for normal VMs, but TDX doesn't.

I don't see why this TDH.MNG.VPFLUSHDONE case is special. If the TDX module is
shutdown and the old kernel is going away, how is anything leaked other than the
normal type of leakage that happens during kexec? So I think maybe this is just
the known vmxoff seamcall race, with the specific case observed generating a
message about leaking.

Also, not sure how handling VP.ENTER would prevent the VPFLUSHDONE call from
meeting an error and emitting the same message. If the TDX module is shutdown...

>
> Requiring KVM to check a global on every entry is pretty ugly, especially
> since this is for a very rare scenario (in terms of number of entries). And
> forcing KVM to do a CALL+RET to check an almost-never-set flag is especially
> ugly.
>
> Why not handle this entirely in tdx_shutdown_cpu()? E.g. have the last CPU
> through disable TDX, and hld all the CPUs hostage until that's done. It's not
> the prettiest thing in the world, but it's entirely self-contained.
>
> static void tdx_shutdown_cpu(void *__nr_cpus_remaining)
> {
> atomic_t *nr_cpus_remaining = __nr_cpus_remaining;
>
> if (!atomic_add_unless(nr_cpus_remaining, -1, 1)) {
> tdx_sys_disable();
> atomic_set(nr_cpus_remaining, 0);
> }
>
> x86_virt_put_ref(X86_FEATURE_VMX);
>
> while (!atomic_read(nr_cpus_remaining))
> cpu_relax();
> }
>
> static void tdx_shutdown(void *ign)
> {
> atomic_t nr_cpus_remaining = ATOMIC_INIT(num_online_cpus());
>
> on_each_cpu(tdx_shutdown_cpu, &nr_cpus_remaining, 1);
> }

After vmxoff happens, the SEAMCALLs will just meet other errors. The wrappers
will morph the vmxoff condition into a SW error that much of the TDX code can't
handle either. So it doesn't help the problem I'm afraid.

It would be my preference to fix the existing issue separately than this series.
This series makes kexec way more functional for TDX, and the worst case AFAICT
is a splat in an otherwise successful kexec. So a non-critical and existing
problem.

Kai and I were previously kicking around some ideas about the general case
problem. It somehow missed our cleanup list, but I just added it.