Re: [PATCH v2 2/5] x86/virt/tdx: Pull kexec cache flush logic into arch/x86
From: Sean Christopherson
Date: Tue Mar 31 2026 - 15:22:42 EST
On Mon, Mar 23, 2026, Vishal Verma wrote:
> From: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
>
> KVM tries to take care of some required cache flushing earlier in the
> kexec path in order to be kind to some long standing races that can occur
> later in the operation. Until recently, VMXOFF was handled within KVM.
> Since VMX being enabled is required to make a SEAMCALL, it had the best
> per-cpu scoped operation to plug the flushing into. So it is kicked off
> from there.
>
> This early kexec cache flushing in KVM happens via a syscore shutdown
> callback. Now that VMX enablement control has moved to arch/x86, which has
> grown its own syscore shutdown callback, it no longer make sense for it to
> live in KVM. It fits better with the TDX enablement managing code.
>
> In addition, future changes will add a SEAMCALL that happens immediately
> before VMXOFF, which means the cache flush in KVM will be too late to
> flush the cache before the last SEAMCALL. So move it to the newly added TDX
> arch/x86 syscore shutdown handler.
>
> Since tdx_cpu_flush_cache_for_kexec() is no longer needed by KVM, make it
> static and remove the export. Since it is also not part of an operation
> spread across disparate components, remove the redundant comments and
> verbose naming.
>
> In the existing KVM based code, CPU offline also funnels through
> tdx_cpu_flush_cache_for_kexec(). So the centralization to the arch/x86
> syscore shutdown callback elides this CPU offline time behavior. However,
> WBINVD is already generally done at CPU offline as matter of course. So
> don't bother adding TDX specific logic for this, and rely on the normal
> WBINVD to handle it.
>
> Acked-by: Kai Huang <kai.huang@xxxxxxxxx>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
> Signed-off-by: Vishal Verma <vishal.l.verma@xxxxxxxxx>
Ingoring the potential fixup needed for the existing bug...
Acked-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
> arch/x86/include/asm/tdx.h | 6 ------
> arch/x86/kvm/vmx/tdx.c | 10 ----------
> arch/x86/virt/vmx/tdx/tdx.c | 39 ++++++++++++++++++++-------------------
> 3 files changed, 20 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 2917b3451491..7674fc530090 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -205,11 +205,5 @@ static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
> static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
> #endif /* CONFIG_INTEL_TDX_HOST */
>
> -#ifdef CONFIG_KEXEC_CORE
> -void tdx_cpu_flush_cache_for_kexec(void);
> -#else
> -static inline void tdx_cpu_flush_cache_for_kexec(void) { }
> -#endif
> -
> #endif /* !__ASSEMBLER__ */
> #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b7264b533feb..50a5cfdbd33e 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -440,16 +440,6 @@ void tdx_disable_virtualization_cpu(void)
> tdx_flush_vp(&arg);
> }
> local_irq_restore(flags);
> -
> - /*
> - * Flush cache now if kexec is possible: this is necessary to avoid
> - * having dirty private memory cachelines when the new kernel boots,
> - * but WBINVD is a relatively expensive operation and doing it during
> - * kexec can exacerbate races in native_stop_other_cpus(). Do it
> - * now, since this is a safe moment and there is going to be no more
> - * TDX activity on this CPU from this point on.
> - */
> - tdx_cpu_flush_cache_for_kexec();
> }
>
> #define TDX_SEAMCALL_RETRIES 10000
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index cb9b3210ab71..0802d0fd18a4 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -224,8 +224,28 @@ static int tdx_offline_cpu(unsigned int cpu)
> return 0;
> }
>
> +static void tdx_cpu_flush_cache(void)
> +{
> + lockdep_assert_preemption_disabled();
> +
> + if (!this_cpu_read(cache_state_incoherent))
> + return;
> +
> + wbinvd();
> + this_cpu_write(cache_state_incoherent, false);
> +}
> +
> static void tdx_shutdown_cpu(void *ign)
> {
> + /*
> + * Flush cache now if kexec is possible: this is necessary to avoid
> + * having dirty private memory cachelines when the new kernel boots,
> + * but WBINVD is a relatively expensive operation and doing it during
> + * kexec can exacerbate races in native_stop_other_cpus(). Do it
> + * now, since this is a safe moment and there is going to be no more
> + * TDX activity on this CPU from this point on.
> + */
> + tdx_cpu_flush_cache();
> x86_virt_put_ref(X86_FEATURE_VMX);
> }
>
> @@ -1920,22 +1940,3 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> }
> EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_hkid);
> -
> -#ifdef CONFIG_KEXEC_CORE
> -void tdx_cpu_flush_cache_for_kexec(void)
> -{
> - lockdep_assert_preemption_disabled();
Is there a pre-existing bug here that gets propagate to tdx_shutdown_cpu()? When
called from kvm_offline_cpu(), preemption won't be fully disabled, but per-CPU
access are fine because the task is pinned to the target CPU.
See https://lore.kernel.org/all/aUVx20ZRjOzKgKqy@xxxxxxxxxx
> -
> - if (!this_cpu_read(cache_state_incoherent))
> - return;
> -
> - /*
> - * Private memory cachelines need to be clean at the time of
> - * kexec. Write them back now, as the caller promises that
> - * there should be no more SEAMCALLs on this CPU.
> - */
> - wbinvd();
> - this_cpu_write(cache_state_incoherent, false);
> -}
> -EXPORT_SYMBOL_FOR_KVM(tdx_cpu_flush_cache_for_kexec);
> -#endif
>
> --
> 2.53.0
>