Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
From: Chao Gao
Date: Sun Jul 06 2025 - 23:16:59 EST
On Thu, Jul 03, 2025 at 06:37:12PM +0300, Adrian Hunter wrote:
>Avoid clearing reclaimed TDX private pages unless the platform is affected
>by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
>time on unaffected systems.
>
>Background
>
>KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
>
> - Clears the TD Owner bit (which identifies TDX private memory) and
> integrity metadata without triggering integrity violations.
> - Clears poison from cache lines without consuming it, avoiding MCEs on
> access (refer TDX Module Base spec. 16.5. Handling Machine Check
> Events during Guest TD Operation).
>
>The TDX module also uses MOVDIR64B to initialize private pages before use.
>If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
>However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
>("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
>
>In contrast, when private pages are reclaimed, the TDX Module handles
>flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
>
>Problem
>
>Clearing all private pages during VM shutdown is costly. For guests
>with a large amount of memory it can take minutes.
>
>Solution
>
>TDX Module Base Architecture spec. documents that private pages reclaimed
>from a TD should be initialized using MOVDIR64B, in order to avoid
>integrity violation or TD bit mismatch detection when later being read
>using a shared HKID, refer April 2025 spec. "Page Initialization" in
>section "8.6.2. Platforms not Using ACT: Required Cache Flush and
>Initialization by the Host VMM"
>
>That is an overstatement and will be clarified in coming versions of the
>spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
>Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
>Mode" in the same spec, there is no issue accessing such reclaimed pages
>using a shared key that does not have integrity enabled. Linux always uses
>KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
>which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
>description in "Intel Architecture Memory Encryption Technologies" spec
>version 1.6 April 2025. So there is no need to clear pages to avoid
>integrity violations.
>
>There remains a risk of poison consumption. However, in the context of
>TDX, it is expected that there would be a machine check associated with the
>original poisoning. On some platforms that results in a panic. However
>platforms may support "SEAM_NR" Machine Check capability, in which case
>Linux machine check handler marks the page as poisoned, which prevents it
>from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
>Implement recovery for errors in TDX/SEAM non-root mode")
Even on a CPU w/ SEAM_NR and w/o X86_BUG_TDX_PW_MCE, is there still a risk of
poisoned memory being returned to the host kernel? Since only poison
consumption causes #MCE, if a poisoned page is never consumed in SEAM non-root
mode, there will be no #MCE, and the mentioned commit won't mark the page as
poisoned.
A reclaimed poisoned page could be reused and potentially cause a kernel panic.
While WBINVD could help, we believe it's not worth it as it will slow down the
vast majority of cases. Is my understanding correct?