Re: [RFC PATCH v5 30/45] x86/virt/tdx: Add API to demote a 2MB mapping to 512 4KB mappings
From: Edgecombe, Rick P
Date: Mon Jun 29 2026 - 17:07:34 EST
On Thu, 2026-06-11 at 16:44 +0800, Yan Zhao wrote:
Nice find.
> Since this contention should occur rarely (e.g., when there's a second TD
> invoking PAMT.ADD concurrently while the first TD is invoking DEMOTE, and the
> DPAMT page pair to add for DEMOTE must reside in the 2MB target range as
> PAMT.ADD), a possible optimization is to avoid holding the global pamt_lock in
> the first invocation of tdh_mem_page_demote() (e.g., by indicating try or fast
> mode); only acquire the global pamt_lock if the first try returns busy, ensuring
> the second invocation must succeed.
Yes, let's not do it for the initial implementation.
>
> [1] https://lore.kernel.org/kvm/aNX6V6OSIwly1hu4@xxxxxxxxxxxxxxxxxxxxxxxxx
> [2] The contention is verified with an internal POC.
> Error logs:
> a.1) DEMOTE adds PAMT pages pfn1=0x19c0a0, pfn2=0x1b572f for guest pfn=0x519800
> 2) __tdx_pamt_get() adds PAMT pages for pfn=0x19c0a1.
> 3) DEMOTE returns error 0x800002000000000c.
> b.1) DEMOTE adds PAMT pages pfn1=0x1b090c, pfn2=0x119b4f for guest pfn=0x511a00
> 2) __tdx_pamt_get() adds PAMT pages for pfn=0x1b090d
> 3) PAMT.ADD returns error 0x8000020000000001.
>
> [3] New implementation:
> u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
> struct page *new_sp, struct tdx_pamt_cache *pamt_cache,
> u64 *ext_err1, u64 *ext_err2)
> {
> bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == PG_LEVEL_2M;
> struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT];
> struct tdx_module_args args = {
> .rcx = gpa | pg_level_to_tdx_sept_level(level),
> .rdx = tdx_tdr_pa(td),
> .r8 = page_to_phys(new_sp),
> };
> atomic_t *pamt_refcount;
> u64 ret;
>
> if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> return TDX_SW_ERROR;
>
> /* Flush the new S-EPT page to be added */
> tdx_clflush_page(new_sp);
>
> if (dpamt) {
>
Can we just not do huge pages if we don't have DPAMT? Actually, weren't we
already going to do it like that?
>
> if (alloc_pamt_array(pamt_pages, pamt_cache))
> return TDX_SW_ERROR;
>
> args.r12 = page_to_phys(pamt_pages[0]);
> args.r13 = page_to_phys(pamt_pages[1]);
>
> /*
> * Before demotion, the 2MB guest memory range is not managed
> * by DPAMT, so its pamt_refcount should be 0.
> * Set it to 512 after demotion succeeds, since removing of each
> * 4KB mapping will reduce the refcount by 1.
> */
> pamt_refcount = tdx_find_pamt_refcount(pfn);
It feels kind of hacky. The refcount stuff is already a bit hard to follow.
>
> spin_lock(&pamt_lock);
> }
> ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args);
>
> if (dpamt) {
> if (!ret)
> WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0, PTRS_PER_PMD));
>
> spin_unlock(&pamt_lock);
>
> if (ret)
> free_pamt_array(pamt_pages);
> }
>
> *ext_err1 = args.rcx;
> *ext_err2 = args.rdx;
>
> return ret;
> }