Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier

From: Binbin Wu
Date: Tue Jul 01 2025 - 02:10:06 EST




On 6/26/2025 6:48 PM, Kai Huang wrote:
On TDX platforms, during kexec, the kernel needs to make sure there's no
dirty cachelines of TDX private memory before booting to the new kernel
to avoid silent memory corruption to the new kernel.

During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
to stop all remote CPUs before booting to the new kernel. The remote
CPUs will then execute stop_this_cpu() to stop themselves.

The kernel has a percpu boolean to indicate whether the cache of a CPU
may be in incoherent state. In stop_this_cpu(), the kernel does WBINVD
if that percpu boolean is true.

TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
This makes sure the caches will be flushed during kexec.

However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
which is extremely rare to happen but could cause system to hang.

Specifically, the native_stop_other_cpus() firstly sends normal reboot
IPI to remote CPUs and wait one second for them to stop. If that times
out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
them.

The aforementioned race happens when NMIs are sent. Doing WBINVD in
stop_this_cpu() makes each CPU take longer time to stop and increases
the chance of the race to happen.

Register reboot notifier in KVM to explicitly flush caches upon
receiving reboot notifier (e.g., during kexec) for TDX. This moves the
WBINVD to an earlier stage than stop_this_cpus(), avoiding a possibly
lengthy operation at a time where it could cause this race.
Two nits below.

Reviewed-by: Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx>


Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx>
Acked-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
Tested-by: Farrah Chen <farrah.chen@xxxxxxxxx>
---

v2 -> v3:
- Update changelog to address Paolo's comments and Add Paolo's Ack:
https://lore.kernel.org/lkml/3a7c0856-6e7b-4d3d-b966-6f17f1aca42e@xxxxxxxxxx/

---
arch/x86/include/asm/tdx.h | 3 +++
arch/x86/kvm/vmx/tdx.c | 45 +++++++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.c | 9 ++++++++
3 files changed, 57 insertions(+)

[...]
+
+static int tdx_reboot_notify(struct notifier_block *nb, unsigned long code,
+ void *unused)
+{
+ /*
+ * Flush cache for all CPUs upon the reboot notifier. This
+ * avoids having to do WBINVD in stop_this_cpu() during kexec.
+ *
+ * Kexec calls native_stop_other_cpus() to stop remote CPUs
+ * before booting to new kernel, but that code has a "race"
+ * when the normal REBOOT IPI timesout and NMIs are sent to

timesout should be times out or timeouts?

+ * remote CPUs to stop them. Doing WBINVD in stop_this_cpu()
+ * could potentially increase the posibility of the "race".
s/posibility/possibility

+ */
+ if (code == SYS_RESTART)
+ on_each_cpu(smp_func_cpu_flush_cache, NULL, 1);
+ return NOTIFY_DONE;
+}
+

[...]