Re: [PATCH 3/4] x86/kexec(): Reset TDX private memory on platforms with TDX erratum

From: Huang, Kai
Date: Thu Feb 01 2024 - 09:35:44 EST




On 1/02/2024 5:21 am, Dave Hansen wrote:
On 1/31/24 03:31, Huang, Kai wrote:
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -28,6 +28,7 @@
#include <asm/setup.h>
#include <asm/set_memory.h>
#include <asm/cpu.h>
+#include <asm/tdx.h>
#ifdef CONFIG_ACPI
/*
@@ -298,9 +299,24 @@ void machine_kexec(struct kimage *image)
void *control_page;
int save_ftrace_enabled;
+ /*
+ * For platforms with TDX "partial write machine check" erratum,
+ * all TDX private pages need to be converted back to normal
+ * before booting to the new kernel, otherwise the new kernel
+ * may get unexpected machine check.
+ *
+ * But skip this when preserve_context is on. The second kernel
+ * shouldn't write to the first kernel's memory anyway. Skipping
+ * this also avoids killing TDX in the first kernel, which would
+ * require more complicated handling.
+ */

This is waaaaaaaaaaaaaaay too much information to stick in a generic
function. Just call tdx_reset_memory() and make the argument about

Hi Dave,

Thanks for review.

Agreed too much info here. But I don't quite understand your second sentence here. Can I have your suggestion again?


#ifdef CONFIG_KEXEC_JUMP
if (image->preserve_context)
save_processor_state();
+ else
+ tdx_reset_memory();
+#else
+ tdx_reset_memory();
#endif

Wow, that's awfully hard to read. I really wish folks' gag reflex would
kick in when they see stuff like this to get them to spend an additional
15 seconds to turn this into:

if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
save_processor_state();
else
tdx_reset_memory();


Yeah this is way better, but as Kirill pointed out we will have build error when both SUSPEND and HIBERNATION is off.

Maybe below?

#ifdef CONFIG_KEXEC_JUMP
if (image->preserve_context)
save_processor_state();
else
#endif
tdx_reset_memory();


save_ftrace_enabled = __ftrace_enabled_save();
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 9f1fed458a32..0537b1b76c2b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -28,6 +28,7 @@
#include <linux/acpi.h>
#include <linux/suspend.h>
#include <linux/acpi.h>
+#include <linux/reboot.h>
#include <asm/page.h>
#include <asm/special_insns.h>
#include <asm/msr-index.h>
@@ -54,6 +55,8 @@ static DEFINE_MUTEX(tdx_module_lock);
/* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
static LIST_HEAD(tdx_memlist);
+static bool tdx_rebooting;
+
typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -1187,6 +1190,9 @@ static int __tdx_enable(void)
{
int ret;
+ if (tdx_rebooting)
+ return -EAGAIN;

This whole 'tdx_rebooting' deserves to at least be in its own patch.
Also -EAGAIN seems to be a really odd return code for a permanent failure.

Will move to a separate patch.

How about just -EINVAL?


ret = init_tdx_module();
if (ret) {
pr_err("module initialization failed (%d)\n", ret);
@@ -1420,6 +1426,90 @@ static struct notifier_block tdx_memory_nb = {
.notifier_call = tdx_memory_notifier,
};
+/*
+ * Convert TDX private pages back to normal on platforms with
+ * "partial write machine check" erratum.
+ *
+ * Called from machine_kexec() before booting to the new kernel.
+ */
+void tdx_reset_memory(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
+ return;
+
+ /*
+ * Kernel read/write to TDX private memory doesn't
+ * cause machine check on hardware w/o this erratum.
+ */
+ if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+ return;
+
+ /* Called from kexec() when only rebooting cpu is alive */
+ WARN_ON_ONCE(num_online_cpus() != 1);
+
+ /*
+ * tdx_reboot_notifier() waits until ongoing TDX module
+ * initialization to finish, and module initialization is
+ * rejected after that. Therefore @tdx_module_status is
+ * stable here and can be read w/o holding lock.
+ */
+ if (tdx_module_status != TDX_MODULE_INITIALIZED)
+ return;

kexec() can't happen until after reboot notifiers are called.
tdx_reboot_notifier() makes 'tdx_module_status' stable, so no lock is
needed.

Right?

Yes. I can replace the comment with your words.


+ /*
+ * Flush cache of all TDX private memory _before_ converting
+ * them back to avoid silent memory corruption.
+ */
+ native_wbinvd();

Since this is single-threaded, it also needs to mention where all the
other CPU caches got flushed.

OK will point out.


+ /*
+ * Convert PAMTs back to normal. All other cpus are already
+ * dead and TDMRs/PAMTs are stable.
+ *
+ * Ideally it's better to cover all types of TDX private pages
+ * here, but it's impractical:
+ *
+ * - There's no existing infrastructure to tell whether a page
+ * is TDX private memory or not.
+ *
+ * - Using SEAMCALL to query TDX module isn't feasible either:
+ * - VMX has been turned off by reaching here so SEAMCALL
+ * cannot be made;
+ * - Even SEAMCALL can be made the result from TDX module may

^ if ^ a ^,


Thanks. Will add.

+ * not be accurate (e.g., remote CPU can be stopped while
+ * the kernel is in the middle of reclaiming TDX private
+ * page and doing MOVDIR64B).
+ *
+ * One temporary solution could be just converting all memory
+ * pages, but it's problematic too, because not all pages are
+ * mapped as writable in direct mapping. It can be done by
+ * switching to the identical mapping for kexec() or a new page
+ * table which maps all pages as writable, but the complexity is
+ * overkill.
+ *
+ * Thus instead of doing something dramatic to convert all pages,
+ * only convert PAMTs here. Other kernel components which use
+ * TDX need to do the conversion on their own by intercepting the
+ * rebooting/shutdown notifier (KVM already does that).
+ */

I'd leave the extended alternatives discussion in the changelog, not here.

Focus on what _this_ is doing and why it is imperfect:

1. Just reset the PAMDs
2. This leaves A, B, and C undealt with
3. The risk of leaving those is ...


Agreed. Will do.


+ tdmrs_reset_pamt_all(&tdx_tdmr_list);
+}
+
+static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
+ void *unused)
+{
+ /* Wait ongoing TDX initialization to finish */
+ mutex_lock(&tdx_module_lock);
+ tdx_rebooting = true;
+ mutex_unlock(&tdx_module_lock);
+
+ return NOTIFY_OK;
+}

Why is 'tdx_rebooting' a new variable instead of a new state in
tdx_module_status?


I think we can but I believe using tdx_rebooting is more clear.

For instance, if we want to add a new state TDX_MODULE_ABORT for this case, when tdx_reboot_notifier() is called after module initialization, then the tdx_module_status (which is already TDX_MODULE_INITIALIZED) will be overwritten and we will lose the information that module initialization has been done successfully.

We may be able to avoid these using more complicated logic but I think using a separate tdx_rebooting is straightforward.