Re: [RFC PATCH v5 19/45] KVM: Allow owner of kvm_mmu_memory_cache to provide a custom page allocator

From: Sean Christopherson

Date: Tue Feb 03 2026 - 21:17:52 EST

On Tue, Feb 03, 2026, Kai Huang wrote:
> On Tue, 2026-02-03 at 12:12 -0800, Sean Christopherson wrote:
> > On Tue, Feb 03, 2026, Kai Huang wrote:
> > > On Wed, 2026-01-28 at 17:14 -0800, Sean Christopherson wrote:
> > > > Extend "struct kvm_mmu_memory_cache" to support a custom page allocator
> > > > so that x86's TDX can update per-page metadata on allocation and free().
> > > >
> > > > Name the allocator page_get() to align with __get_free_page(), e.g. to
> > > > communicate that it returns an "unsigned long", not a "struct page", and
> > > > to avoid collisions with macros, e.g. with alloc_page.
> > > >
> > > > Suggested-by: Kai Huang <kai.huang@xxxxxxxxx>
> > > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > >
> > > I thought it could be more generic for allocating an object, but not just a
> > > page.
> > >
> > > E.g., I thought we might be able to use it to allocate a structure which has
> > > "pair of DPAMT pages" so it could be assigned to 'struct kvm_mmu_page'. But
> > > it seems you abandoned this idea. May I ask why? Just want to understand
> > > the reasoning here.
> >
> > Because that requires more complexity and there's no known use case, and I don't
> > see an obvious way for a use case to come along. All of the motiviations for a
> > custom allocation scheme that I can think of apply only to full pages, or fit
> > nicely in a kmem_cache.
> >
> > Specifically, the "cache" logic is already bifurcated between "kmem_cache' and
> > "page" usage. Further splitting the "page" case doesn't require modifications to
> > the "kmem_cache" case, whereas providing a fully generic solution would require
> > additional changes, e.g. to handle this code:
> >
> > page = (void *)__get_free_page(gfp_flags);
> > if (page && mc->init_value)
> > memset64(page, mc->init_value, PAGE_SIZE / sizeof(u64));
> >
> > It certainly wouldn't be much complexity, but this code is already a bit awkward,
> > so I don't think it makes sense to add support for something that will probably
> > never be used.
>
> For this particular piece of code, we can add a helper for allocating normal
> page table pages, get rid of mc->init_value completely and hook mc-page_get()
> to that helper.

Hmm, I like the idea, but I don't think it would be a net positive. In practice,
x86's "normal" page tables stop being normal, because KVM now initializes all
SPTEs with BIT(63)=1 on x86-64. And that would also incur an extra RETPOLINE on
all those allocations.

> A bonus is we can then call that helper in all places when KVM needs to
> allocate a page for normal page table instead of just calling
> get_zerod_pages() directly, e.g., like the one in
> tdp_mmu_alloc_sp_for_split(),

Huh. Actually, that's a bug, but not the one you probably expect. At a glance,
it looks like KVM incorrectly zeroing the page instead of initializing it with
SHADOW_NONPRESENT_VALUE. But it's actually a "performance" bug, because KVM
doesn't actually need to pre-initialize the page: either the page will never be
used, or every SPTE will be initialized as a child SPTE.

So that one _should_ be different, e.g. should be:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a32192c35099..36afd67601fc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1456,7 +1456,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
if (!sp)
return NULL;

- sp->spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+ sp->spt = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
if (!sp->spt)
goto err_spt;

> so that we can have a consistent way for allocating normal page table pages.