Re: [RFC PATCH v5 30/45] x86/virt/tdx: Add API to demote a 2MB mapping to 512 4KB mappings

From: Yan Zhao

Date: Mon Jun 29 2026 - 22:38:13 EST


On Tue, Jun 30, 2026 at 05:07:13AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2026-06-11 at 16:44 +0800, Yan Zhao wrote:
>
> Nice find.
>
> > Since this contention should occur rarely (e.g., when there's a second TD
> > invoking PAMT.ADD concurrently while the first TD is invoking DEMOTE, and the
> > DPAMT page pair to add for DEMOTE must reside in the 2MB target range as
> > PAMT.ADD),  a possible optimization is to avoid holding the global pamt_lock in
> > the first invocation of tdh_mem_page_demote() (e.g., by indicating try or fast
> > mode); only acquire the global pamt_lock if the first try returns busy, ensuring
> > the second invocation must succeed.
>
> Yes, let's not do it for the initial implementation.
Ok. I can put the optimization in the TODO list.

> >
> > [1] https://lore.kernel.org/kvm/aNX6V6OSIwly1hu4@xxxxxxxxxxxxxxxxxxxxxxxxx
> > [2] The contention is verified with an internal POC.
> >     Error logs:
> >     a.1) DEMOTE adds PAMT pages pfn1=0x19c0a0, pfn2=0x1b572f for guest pfn=0x519800
> >       2) __tdx_pamt_get() adds PAMT pages for pfn=0x19c0a1.
> >       3) DEMOTE returns error 0x800002000000000c.
> >     b.1) DEMOTE adds PAMT pages pfn1=0x1b090c, pfn2=0x119b4f for guest pfn=0x511a00
> >       2) __tdx_pamt_get() adds PAMT pages for pfn=0x1b090d
> >       3) PAMT.ADD returns error 0x8000020000000001.
> >
> > [3] New implementation:
> > u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
> >                         struct page *new_sp, struct tdx_pamt_cache *pamt_cache, 
> >                         u64 *ext_err1, u64 *ext_err2)                           
> > {                                                                               
> >         bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == PG_LEVEL_2M;
> >         struct page *pamt_pages[TDX_DPAMT_ENTRY_PAGE_CNT];                      
> >         struct tdx_module_args args = {                                         
> >                 .rcx = gpa | pg_level_to_tdx_sept_level(level),                 
> >                 .rdx = tdx_tdr_pa(td),                                          
> >                 .r8 = page_to_phys(new_sp),                                     
> >         };                                                                      
> >         atomic_t *pamt_refcount;                                                
> >         u64 ret;                                                                
> >                                                                                 
> >         if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))                     
> >                 return TDX_SW_ERROR;                                            
> >                                                                                 
> >         /* Flush the new S-EPT page to be added */                              
> >         tdx_clflush_page(new_sp);                                               
> >                                                                                 
> >         if (dpamt) {                             
> >
>
> Can we just not do huge pages if we don't have DPAMT? Actually, weren't we
Yes, we can.
I wasn't aware we're going to disallow huge page if we don't have DPAMT.
I thought we could have huge page with and without DPAMT.

> already going to do it like that?
Is this to avoid the "if (dpamt)" check here?
We may still need to check "level == PG_LEVEL_2M" here, since DPAMT related
code is not necessary if the demotion is from 1GB to 2MB.
(This tdh_mem_page_demote() implementation could support 1GB --> 2MB demotion
theoretically).
Or maybe for now just warn and return TDX_SW_ERROR if level != PG_LEVEL_2M ?

> >                                
> >                 if (alloc_pamt_array(pamt_pages, pamt_cache))                   
> >                         return TDX_SW_ERROR;                                    
> >                                                                                 
> >                 args.r12 = page_to_phys(pamt_pages[0]);                         
> >                 args.r13 = page_to_phys(pamt_pages[1]);                         
> >                                                                                 
> >                 /*                                                              
> >                  * Before demotion, the 2MB guest memory range is not managed   
> >                  * by DPAMT, so its pamt_refcount should be 0.                  
> >                  * Set it to 512 after demotion succeeds, since removing of each
> >                  * 4KB mapping will reduce the refcount by 1.                   
> >                  */                                                             
> >                 pamt_refcount = tdx_find_pamt_refcount(pfn);                    
>
> It feels kind of hacky. The refcount stuff is already a bit hard to follow.
After demotion, there would be 512 4KB mappings for this 2MB range.
Then, in later tdx_sept_remove_leaf_spte(), tdx_pamt_put() would be invoked
for 512 times. That's why we need to set the refcount to 512 after a successful
demotion here.

>
> >                                                                                 
> >                 spin_lock(&pamt_lock);                                          
> >         }
> > ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args);
> >
> >         if (dpamt) {
> >                 if (!ret)
> >                         WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0, PTRS_PER_PMD));
> >
> >                 spin_unlock(&pamt_lock);
> >
> >                 if (ret)
> >                         free_pamt_array(pamt_pages);
> >         }
> >
> >         *ext_err1 = args.rcx;
> >         *ext_err2 = args.rdx;
> >
> >         return ret;
> > }
>