Re: [PATCH v5 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs

From: Chao Gao
Date: Fri Aug 01 2025 - 04:31:35 EST


On Tue, Jul 29, 2025 at 12:28:41AM +1200, Kai Huang wrote:
>On TDX platforms, during kexec, the kernel needs to make sure there are
>no dirty cachelines of TDX private memory before booting to the new
>kernel to avoid silent memory corruption to the new kernel.
>
>During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
>to stop all remote CPUs before booting to the new kernel. The remote
>CPUs will then execute stop_this_cpu() to stop themselves.
>
>The kernel has a percpu boolean to indicate whether the cache of a CPU
>may be in incoherent state. In stop_this_cpu(), the kernel does WBINVD
>if that percpu boolean is true.
>
>TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
>This makes sure the caches will be flushed during kexec.
>
>However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
>which is extremely rare to happen but could cause the system to hang.
>
>Specifically, the native_stop_other_cpus() firstly sends normal reboot
>IPI to remote CPUs and waits one second for them to stop. If that times
>out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
>them.
>
>The aforementioned race happens when NMIs are sent. Doing WBINVD in
>stop_this_cpu() makes each CPU take longer time to stop and increases
>the chance of the race happening.
>
>Explicitly flush cache in tdx_disable_virtualization_cpu() after which
>no more TDX activity can happen on this cpu. This moves the WBINVD to
>an earlier stage than stop_this_cpus(), avoiding a possibly lengthy
>operation at a time where it could cause this race.
>
>Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx>
>Acked-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>Tested-by: Farrah Chen <farrah.chen@xxxxxxxxx>
>Reviewed-by: Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx>

Flushing cache after disabling virtualization looks clean. So,

Reviewed-by: Chao Gao <chao.gao@xxxxxxxxx>