Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path

From: Yan Zhao
Date: Wed Dec 10 2025 - 21:40:43 EST


On Wed, Dec 10, 2025 at 05:42:44PM -0800, Vishal Annapurve wrote:
> On Sun, Dec 7, 2025 at 9:51 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 6b6c46c27390..508b133df903 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1591,6 +1591,8 @@ struct kvm_arch {
> > > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> > > > struct kvm_mmu_memory_cache split_desc_cache;
> > > >
> > > > + struct kvm_mmu_memory_cache pamt_page_cache;
> > > > +
> > >
> > > The latest DPAMT patches use a per-vcpu tdx_prealloc struct to handle
> > > preallocating pages for pamt. I'm wondering if you've considered how
> > > this would work here since some of the calls requiring pamt originate
> > > from user space ioctls and therefore are not associated with a vcpu.
> > I'll use a per-VM tdx_prealloc struct for splitting here, similar to the
> > per-VM pamt_page_cache.
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 43dd295b7fd6..91bea25da528 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -48,6 +48,9 @@ struct kvm_tdx {
> > * Set/unset is protected with kvm->mmu_lock.
> > */
> > bool wait_for_sept_zap;
> > +
> > + spinlock_t external_kvm_split_cache_lock;
> > + struct tdx_prealloc prealloc_split_cache;
> > };
> >
> > > Since the tdx_prealloc is a per vcpu struct there are no race issues
> > > when multiple vcpus need to add pamt pages but here it would be
> > > trickier here because theoretically, multiple threads could split
> > > different pages simultaneously.
> > A spin lock external_kvm_split_cache_lock is introduced to protect the cache
> > enqueue and dequeue.
> > (a) When tdp_mmu_split_huge_pages_root() is invoked under write mmu_lock:
> > - Since cache dequeue is already under write mmu_lock in
> > tdp_mmu_split_huge_page()-->tdx_sept_split_private_spte(), acquiring/
> > releasing another spin lock doesn't matter.
> > - Though the cache enqueue in topup_external_split_cache() is not protected
> > by mmu_lock, protecting enqueue with a spinlock should not reduce
> > concurrency.
>
> Even with the spin lock protecting the cache topup/consumption
> operation, is it possible that one split operation context consumes
> the top-up performed by the other split operation causing failure with
> the subsequent consumptions?
The sequence of check topup, topup, and consume is like this

1. write_lock(&kvm->mmu_lock)
check topup

2. write_unlock(&kvm->mmu_lock)
topup (get/put split lock to enqueue)

3. write_lock(&kvm->mmu_lock)
check topup (goto 2 if topup is necessary) (*)
get split lock
consume
put split lock
write_unlock(&kvm->mmu_lock)

Note: due to the "iter.yielded = true" and "continue" after the topup, (see my
posted diff in last reply), consuming does not directly follow the topup. i.e.,
there is step 3.

Due to (*) in step 3, and the consuming is under write mmu_lock, it's impossible
for splits in other threads to consume pages allocated for this split.


> > (b) When tdp_mmu_split_huge_pages_root() is invoked under read mmu_lock:
> > Introducing a new spinlock may hurt concurrency for a brief duration (which
> > is necessary).
Let's talk more about this future potential use cases (i.e., if there're
multiple callers of tdp_mmu_split_huge_pages_root() under shared mmu_lock).
The sequence would be

1. read_lock(&kvm->mmu_lock)
check topup

2. read_unlock(&kvm->mmu_lock)
topup (get/put split lock to enqueue)

3. read_lock(&kvm->mmu_lock)
check topup (goto 2 if topup is necessary)

get split lock
check topup (return retry if topup is necessary) (**)
consume
put split lock
read_unlock(&kvm->mmu_lock)


Due to (**) in step 3, and the consuming is under split lock and read mmu_lock,
it's also impossible for splits in other threads to consume pages allocated for
this split.


> > However, there's no known (future) use case for multiple threads invoking
> > tdp_mmu_split_huge_pages_root() on mirror root under shared mmu_lock.
> >
> > For future splitting under shared mmu_lock in the fault path, we'll use
> > the per-vCPU tdx_prealloc instead of the per-VM cache. TDX can leverage
> > kvm_get_running_vcpu() to differentiate between the two caches.
> >
> >