Re: [RFC PATCH v5 30/45] x86/virt/tdx: Add API to demote a 2MB mapping to 512 4KB mappings
From: Yan Zhao
Date: Mon Jun 29 2026 - 22:38:13 EST
On Tue, Jun 30, 2026 at 05:07:13AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2026-06-11 at 16:44 +0800, Yan Zhao wrote:
>
> Nice find.
>
> > Since this contention should occur rarely (e.g., when there's a second TD
> > invoking PAMT.ADD concurrently while the first TD is invoking DEMOTE, and the
> > DPAMT page pair to add for DEMOTE must reside in the 2MB target range as
> > PAMT.ADD), a possible optimization is to avoid holding the global pamt_lock in
> > the first invocation of tdh_mem_page_demote() (e.g., by indicating try or fast
> > mode); only acquire the global pamt_lock if the first try returns busy, ensuring
> > the second invocation must succeed.
>
> Yes, let's not do it for the initial implementation.
Ok. I can put the optimization in the TODO list.
> >
> > [1] https://lore.kernel.org/kvm/aNX6V6OSIwly1hu4@xxxxxxxxxxxxxxxxxxxxxxxxx
> > [2] The contention is verified with an internal POC.
> > Error logs:
> > a.1) DEMOTE adds PAMT pages pfn1=0x19c0a0, pfn2=0x1b572f for guest pfn=0x519800
> > 2) __tdx_pamt_get() adds PAMT pages for pfn=0x19c0a1.
> > 3) DEMOTE returns error 0x800002000000000c.
> > b.1) DEMOTE adds PAMT pages pfn1=0x1b090c, pfn2=0x119b4f for guest pfn=0x511a00
> > 2) __tdx_pamt_get() adds PAMT pages for pfn=0x1b090d
> > 3) PAMT.ADD returns error 0x8000020000000001.
> >
> > [3] New implementation:
> > u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
> > struct page *new_sp, struct tdx_pamt_cache *pamt_cache,
> > u64 *ext_err1, u64 *ext_err2)
> > {
> > bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == PG_LEVEL_2M;
> > struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT];
> > struct tdx_module_args args = {
> > .rcx = gpa | pg_level_to_tdx_sept_level(level),
> > .rdx = tdx_tdr_pa(td),
> > .r8 = page_to_phys(new_sp),
> > };
> > atomic_t *pamt_refcount;
> > u64 ret;
> >
> > if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> > return TDX_SW_ERROR;
> >
> > /* Flush the new S-EPT page to be added */
> > tdx_clflush_page(new_sp);
> >
> > if (dpamt) {
> >
>
> Can we just not do huge pages if we don't have DPAMT? Actually, weren't we
Yes, we can.
I wasn't aware we're going to disallow huge page if we don't have DPAMT.
I thought we could have huge page with and without DPAMT.
> already going to do it like that?
Is this to avoid the "if (dpamt)" check here?
We may still need to check "level == PG_LEVEL_2M" here, since DPAMT related
code is not necessary if the demotion is from 1GB to 2MB.
(This tdh_mem_page_demote() implementation could support 1GB --> 2MB demotion
theoretically).
Or maybe for now just warn and return TDX_SW_ERROR if level != PG_LEVEL_2M ?
> >
> > if (alloc_pamt_array(pamt_pages, pamt_cache))
> > return TDX_SW_ERROR;
> >
> > args.r12 = page_to_phys(pamt_pages[0]);
> > args.r13 = page_to_phys(pamt_pages[1]);
> >
> > /*
> > * Before demotion, the 2MB guest memory range is not managed
> > * by DPAMT, so its pamt_refcount should be 0.
> > * Set it to 512 after demotion succeeds, since removing of each
> > * 4KB mapping will reduce the refcount by 1.
> > */
> > pamt_refcount = tdx_find_pamt_refcount(pfn);
>
> It feels kind of hacky. The refcount stuff is already a bit hard to follow.
After demotion, there would be 512 4KB mappings for this 2MB range.
Then, in later tdx_sept_remove_leaf_spte(), tdx_pamt_put() would be invoked
for 512 times. That's why we need to set the refcount to 512 after a successful
demotion here.
>
> >
> > spin_lock(&pamt_lock);
> > }
> > ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args);
> >
> > if (dpamt) {
> > if (!ret)
> > WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0, PTRS_PER_PMD));
> >
> > spin_unlock(&pamt_lock);
> >
> > if (ret)
> > free_pamt_array(pamt_pages);
> > }
> >
> > *ext_err1 = args.rcx;
> > *ext_err2 = args.rdx;
> >
> > return ret;
> > }
>