Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs

From: Yan Zhao

Date: Wed Apr 01 2026 - 05:14:23 EST


Below are responses to issues reported by sashiko. [1]

[1] https://sashiko.dev/#/patchset/20260327201421.2824383-1-rick.p.edgecombe%40intel.com

> @@ -627,21 +615,41 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> "a temporary frozen SPTE.\n"
> "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
> as_id, gfn, old_spte, new_spte, level);
> - return;
> + return 0;
> }
>
> - if (is_leaf != was_leaf)
> - kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> -
> /*
> * Recursively handle child PTs if the change removed a subtree from
> * the paging structure. Note the WARN on the PFN changing without the
> * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
> * pages are kernel allocations and should never be migrated.
> + *
> + * When modifying leaf entries in mirrored page tables, propagate all
> + * changes to the external SPTE.
> */
> if (was_present && !was_leaf &&
> - (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> + } else if (is_mirror_sp(sp) && is_present) {
> + r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
> + if (r)
> + return r;
> + }
Issue #1:
"If a present non-leaf SPTE is replaced with a present leaf SPTE, such as
during a hugepage collapse, the first branch executes and calls
handle_removed_pt(). Because of the mutually exclusive "else if",
set_external_spte_present() is bypassed for the new leaf SPTE. Could this
leave the external S-EPT out of sync, with the parent entry pointing to a
freed child page table?
"
Response:
This is the page merging scenario. The mirror root does not allow huge pages yet,
So, maybe nothing is required for now.
After the basic TDX huge page series, page merging is also disabled [4].
See more in the responses to issues #2 and #3.

Issue #2:
"Additionally, if an SPTE is atomically zapped, is_present evaluates to false.
Because tdp_mmu_set_spte_atomic() lacks a fallback remove_external_spte()
call like the non-atomic tdp_mmu_set_spte() has, does this mean the external
S-EPT mapping remains fully present while KVM marks it as non-present?
"

Response:
Up to this patch, this seems to be a valid concern. But atomic zap on the mirror
root isn't allowed yet (though warning on such case is dropped in patch 5 [2]).
Centralizing atomic zap to __handle_changed_spte() will be done in patch 15 [3].
Maybe move patch 5 to after patch 15?

[2] https://lore.kernel.org/kvm/20260327201421.2824383-6-rick.p.edgecombe@xxxxxxxxx/
[3] https://lore.kernel.org/kvm/20260327201421.2824383-16-rick.p.edgecombe@xxxxxxxxx/
[4] https://lore.kernel.org/all/20260106102024.25023-1-yan.y.zhao@xxxxxxxxx/

Thinking more about centralizing TDX hooks, could we be more aggressive? i.e.,
let TDX just have a single hook set_external_spte() for propagation of changes
from mirror page table to S-EPT?
(below change is on code base with TDX huge page support).

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 13910d9af03a..d9404aeae5f3 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,7 +95,6 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
-KVM_X86_OP_OPTIONAL(reclaim_external_spt)
KVM_X86_OP_OPTIONAL_RET0(topup_external_cache)
KVM_X86_OP(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 76f119a6412b..b722cce0b994 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1857,11 +1857,6 @@ struct kvm_x86_ops {
int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, u64 old_spte,
u64 new_spte, enum pg_level level);

- /* Update external page tables for page table about to be freed. */
- void (*reclaim_external_spt)(struct kvm *kvm, gfn_t gfn,
- struct kvm_mmu_page *sp);
-
-
int (*topup_external_cache)(struct kvm *kvm, struct kvm_vcpu *vcpu, int min_nr_spts);

bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2f6bfc875de1..9b4fa5b31f23 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -461,9 +461,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
}

- if (is_mirror_sp(sp))
- kvm_x86_call(reclaim_external_spt)(kvm, base_gfn, sp);
-
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

@@ -563,9 +560,17 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
* changes to the external SPTE.
*/
if (was_present && !was_leaf &&
- (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+ (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
- } else if (is_mirror_sp(sp)) {
+
+ if (is_mirror_sp(sp)) {
+ /*
+ * Can also propagate changes to remove external pt. Since this
+ * occurs after the call_rcu() in handle_removed_pt(), the RCU
+ * read lock must be held.
+ */
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
+
r = kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
new_spte, level);
if (r)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 32d0c206ade6..cc03b9c7838c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1965,9 +1965,13 @@ static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level leve
return ret;
}

-static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
+static int tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
struct kvm_mmu_page *sp)
{
+ int ret;
+
+ lockdep_assert_held_write(&kvm->mmu_lock);
+
/*
* KVM doesn't (yet) zap page table pages in mirror page table while
* TD is active, though guest pages mapped in mirror page table could be
@@ -1981,8 +1985,10 @@ static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
* the page to prevent the kernel from accessing the encrypted page.
*/
if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
- tdx_reclaim_page(virt_to_page(sp->external_spt)))
+ tdx_reclaim_page(virt_to_page(sp->external_spt))) {
+ ret = -EIO;
goto out;
+ }

/*
* Immediately free the S-EPT page as the TDX subsystem doesn't support
@@ -1990,8 +1996,10 @@ static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
* TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding readers.
*/
free_page((unsigned long)sp->external_spt);
+ ret = 0;
out:
sp->external_spt = NULL;
+ return ret;
}

static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -2045,10 +2053,17 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
u64 new_spte, enum pg_level level)
{
- if (is_shadow_present_pte(old_spte) && is_shadow_present_pte(new_spte))
- return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
- else if (is_shadow_present_pte(old_spte))
- return tdx_sept_remove_private_spte(kvm, gfn, level, old_spte);
+ if (is_shadow_present_pte(old_spte)) {
+ if (is_shadow_present_pte(new_spte)) {
+ if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
+ return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
+ } else {
+ if (is_last_spte(old_spte, level))
+ return tdx_sept_remove_private_spte(kvm, gfn, level, old_spte);
+ else
+ return tdx_sept_reclaim_private_spt(kvm, gfn, spte_to_child_sp(old_spte));
+ }
+ }

if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
return -EIO;
@@ -3915,7 +3930,6 @@ void __init tdx_hardware_setup(void)
vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, sizeof(struct kvm_tdx));

vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
- vt_x86_ops.reclaim_external_spt = tdx_sept_reclaim_private_spt;
vt_x86_ops.gmem_convert = tdx_gmem_convert;



> @@ -708,18 +698,49 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
> struct tdp_iter *iter,
> u64 new_spte)
> {
> + struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
> int ret;
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> - ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> + /* KVM should never freeze SPTEs using higher level APIs. */
> + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> +
> + /*
> + * Temporarily freeze the SPTE until the external PTE operation has
> + * completed (unless the new SPTE itself will be frozen), e.g. so that
> + * concurrent faults don't attempt to install a child PTE in the
> + * external page table before the parent PTE has been written, or try
> + * to re-install a page table before the old one was removed.
> + */
> + if (is_mirror_sptep(iter->sptep))
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> + else
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> if (ret)
> return ret;
>
> - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> - new_spte, iter->level, true);
> + ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> + new_spte, iter->level, true);
>
> - return 0;
> + /*
> + * Unfreeze the mirror SPTE. If updating the external SPTE failed,
> + * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> + * otherwise set the mirror SPTE to the new desired value.
> + */
> + if (is_mirror_sptep(iter->sptep)) {
> + if (ret)
> + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> + else
> + __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
Issue 3:
"
If the mutually exclusive branches in __handle_changed_spte() are updated
so both can run, there seems to be a risk during rollback. The function
handle_removed_pt() unlinks the child page table and schedules it for RCU
destruction. If set_external_spte_present() then fails, the rollback
code here restores iter->old_spte. Since iter->old_spte points to the child
page table that was just scheduled for freeing, can this cause a
use-after-free when the old SPTE is restored?
"

Response:
As in the response to issue 1, changes from non-leaf present to leaf present
in mirror page table are not valid. We need special handling in TDX MMU to
support such scenario in the future. e.g., TDX requires all children entries
must have been all mapped or all unmapped before merging.
Maybe TDX huge page series could skip the handle_removed_pt() for the page
merging case first if more robustness is preferred to guard unexpected page
merging.


Issue 4:
"
Furthermore, tdp_mmu_set_spte_atomic() now returns errors from
__handle_changed_spte(). Previously, it only returned an error (like -EBUSY)
if the try_cmpxchg64() failed, which natively updated iter->old_spte to the
new memory value.
If try_cmpxchg64() succeeds but __handle_changed_spte() fails, iter->old_spte
is not updated. Callers like clear_dirty_gfn_range() loop on failure assuming
the old value was refreshed:
clear_dirty_gfn_range() {
...
retry:
...
if (tdp_mmu_set_spte_atomic(kvm, &iter, iter.old_spte & ~dbit))
goto retry;
...
}
Will this result in an infinite loop, as the caller retries the exact same
atomic operation with the unchanged iter.old_spte?
"

Response:
Though the only expected error from __handle_changed_spte() is also -EBUSY, the
infinite loop is possible if updating the external page table constantly fails
due to contentions.
However, it's currently benign since callers like clear_dirty_gfn_range() can't
operate on the mirror page table.