Re: [RFC PATCH v5 30/45] x86/virt/tdx: Add API to demote a 2MB mapping to 512 4KB mappings

From: Edgecombe, Rick P

Date: Mon Jun 29 2026 - 17:07:34 EST


On Thu, 2026-06-11 at 16:44 +0800, Yan Zhao wrote:

Nice find.

> Since this contention should occur rarely (e.g., when there's a second TD
> invoking PAMT.ADD concurrently while the first TD is invoking DEMOTE, and the
> DPAMT page pair to add for DEMOTE must reside in the 2MB target range as
> PAMT.ADD),  a possible optimization is to avoid holding the global pamt_lock in
> the first invocation of tdh_mem_page_demote() (e.g., by indicating try or fast
> mode); only acquire the global pamt_lock if the first try returns busy, ensuring
> the second invocation must succeed.

Yes, let's not do it for the initial implementation.

>
> [1] https://lore.kernel.org/kvm/aNX6V6OSIwly1hu4@xxxxxxxxxxxxxxxxxxxxxxxxx
> [2] The contention is verified with an internal POC.
>     Error logs:
>     a.1) DEMOTE adds PAMT pages pfn1=0x19c0a0, pfn2=0x1b572f for guest pfn=0x519800
>       2) __tdx_pamt_get() adds PAMT pages for pfn=0x19c0a1.
>       3) DEMOTE returns error 0x800002000000000c.
>     b.1) DEMOTE adds PAMT pages pfn1=0x1b090c, pfn2=0x119b4f for guest pfn=0x511a00
>       2) __tdx_pamt_get() adds PAMT pages for pfn=0x1b090d
>       3) PAMT.ADD returns error 0x8000020000000001.
>
> [3] New implementation:
> u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
>                         struct page *new_sp, struct tdx_pamt_cache *pamt_cache, 
>                         u64 *ext_err1, u64 *ext_err2)                           
> {                                                                               
>         bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == PG_LEVEL_2M;
>         struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT];                      
>         struct tdx_module_args args = {                                         
>                 .rcx = gpa | pg_level_to_tdx_sept_level(level),                 
>                 .rdx = tdx_tdr_pa(td),                                          
>                 .r8 = page_to_phys(new_sp),                                     
>         };                                                                      
>         atomic_t *pamt_refcount;                                                
>         u64 ret;                                                                
>                                                                                 
>         if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))                     
>                 return TDX_SW_ERROR;                                            
>                                                                                 
>         /* Flush the new S-EPT page to be added */                              
>         tdx_clflush_page(new_sp);                                               
>                                                                                 
>         if (dpamt) {                             
>

Can we just not do huge pages if we don't have DPAMT? Actually, weren't we
already going to do it like that?

>                                
>                 if (alloc_pamt_array(pamt_pages, pamt_cache))                   
>                         return TDX_SW_ERROR;                                    
>                                                                                 
>                 args.r12 = page_to_phys(pamt_pages[0]);                         
>                 args.r13 = page_to_phys(pamt_pages[1]);                         
>                                                                                 
>                 /*                                                              
>                  * Before demotion, the 2MB guest memory range is not managed   
>                  * by DPAMT, so its pamt_refcount should be 0.                  
>                  * Set it to 512 after demotion succeeds, since removing of each
>                  * 4KB mapping will reduce the refcount by 1.                   
>                  */                                                             
>                 pamt_refcount = tdx_find_pamt_refcount(pfn);                    

It feels kind of hacky. The refcount stuff is already a bit hard to follow.

>                                                                                 
>                 spin_lock(&pamt_lock);                                          
>         }
> ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args);
>
>         if (dpamt) {
>                 if (!ret)
>                         WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0, PTRS_PER_PMD));
>
>                 spin_unlock(&pamt_lock);
>
>                 if (ret)
>                         free_pamt_array(pamt_pages);
>         }
>
>         *ext_err1 = args.rcx;
>         *ext_err2 = args.rdx;
>
>         return ret;
> }