Re: [PATCH] KVM: x86/tdp_mmu: Trigger the callback only when an interesting change

From: Sean Christopherson
Date: Thu Sep 12 2024 - 20:08:08 EST


On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> KVM MMU behavior
> ================
> The leaf SPTE state machine is coded in make_spte(). Consider AD bits and
> the present bits for simplicity. The two functionalities and AD bits
> support are related in this context. unsync (manipulate D bit and W bit,
> and handle write protect fault) and access tracking (manipulate A bit and
> present bit, and hand handle page fault). (We don't consider dirty page
> tracking for now as it's future work of TDX KVM.)
>
> * If AD bit is enabled:
> D bit state change for dirty page tracking
> On the first EPT violation without prefetch,
> - D bits are set.
> - Make SPTE writable as TDX supports only RXW (or if write fault).
> (TDX KVM doesn't support write protection at this state. It's future work.)
>
> On the second EPT violation.
> - clear D bits if write fault.

Heh, I was literally just writing changelogs for patches to address this (I told
Sagi I would do it "this week"; that was four weeks ago).

This is a bug in make_spte(). Replacing a W=1,D=1 SPTE with a W=1,D=0 SPTE is
nonsensical. And I'm pretty sure it's an outright but for the TDP MMU (see below).

Right now, the fixes for make_spte() are sitting toward the end of the massive
kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
fairly confident that series can land in 6.13 (lots and lots of small patches).

---
Author: Sean Christopherson <seanjc@xxxxxxxxxx>
AuthorDate: Thu Sep 12 16:23:21 2024 -0700
Commit: Sean Christopherson <seanjc@xxxxxxxxxx>
CommitDate: Thu Sep 12 16:35:06 2024 -0700

KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits

Do a remote TLB flush if installing a leaf SPTE overwrites an existing
leaf SPTE (with the same target pfn) and clears the Writable bit or the
Dirty bit. KVM isn't _supposed_ to clear Writable or Dirty bits in such
a scenario, but make_spte() has a flaw where it will fail to set the Dirty
if the existing SPTE is writable.

E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
second vCPU. If the first vCPU (or another vCPU) accesses memory using
the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
the only SPTE that is dirty at the time of the next relevant clearing of
the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
because it sees the D=0 SPTE, and thus will complete the clearing of the
dirty logs without performing a TLB flush.

Opportunistically harden the TDP MMU against clearing the Writable bit as
well, both to prevent similar bugs for write-protection, but also so that
the logic can eventually be deduplicated into spte.c (mmu_spte_update() in
the shadow MMU has similar logic).

Fix the bug in the TDP MMU's page fault handler even though make_spte() is
clearly doing odd things, e.g. it marks the page dirty in its slot but
doesn't set the Dirty bit. Precisely because replacing a leaf SPTE with
another leaf SPTE is so rare, the cost of hardening KVM against such bugs
is negligible. The make_spte() will be addressed in a future commit.

Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler")
Cc: David Matlack <dmatlack@xxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3b996c1fdaab..7c6d1c610f0e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1038,7 +1038,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
return RET_PF_RETRY;
else if (is_shadow_present_pte(iter->old_spte) &&
- !is_last_spte(iter->old_spte, iter->level))
+ (!is_last_spte(iter->old_spte, iter->level) ||
+ (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
+ (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))))
kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);

/*
---

> arch/x86/kvm/mmu/spte.h | 6 ++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++---
> 2 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index a72f0e3bde17..1726f8ec5a50 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> */
> #define FROZEN_SPTE (SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
>
> +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK \
> + (shadow_acc_track_mask | \
> + (SHADOW_ACC_TRACK_SAVED_BITS_MASK << \
> + SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) | \

Just make TDX require A/D bits, there's no reason to care about access tracking.

> + shadow_dirty_mask | shadow_accessed_mask)
> +
> /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 019b43723d90..cfb82ede8982 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -503,8 +503,6 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> int ret = 0;
>
> - KVM_BUG_ON(was_present, kvm);
> -
> lockdep_assert_held(&kvm->mmu_lock);
> /*
> * We need to lock out other updates to the SPTE until the external
> @@ -519,10 +517,34 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> * external page table, or leaf.
> */
> if (is_leaf) {
> - ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> + /*
> + * SPTE is state machine with software available bits used.
> + * Check if the change is interesting to the backend.
> + */
> + if (!was_present)
> + ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> + else {
> + /*
> + * The external PTEs don't need updates for some bits,
> + * but if others are changed, bug the VM.
> + */
> + if (KVM_BUG_ON(~EXTERNAL_SPTE_IGNORE_CHANGE_MASK &

There's no reason to bug the VM. WARN, yes (maybe), but not bug the VM. And I
think this should be short-circuited somewhere further up the chain, i.e. not
just for TDX.

One idea would be to WARN and skip setting the SPTE in tdp_mmu_map_handle_target_level().
I.e. WARN and ignore 1=>0 transitions for Writable and Dirty bits, and then drop
the TLB flush (assuming the above patch lands).

> + (old_spte ^ new_spte), kvm)) {

Curly braces are unnecessary.

> + ret = -EIO;
> + }
> + }
> +
> + /*
> + * The backend shouldn't return an error except EAGAIN.
> + * It's hard to debug without those info.
> + */
> + if (ret && ret != EAGAIN)
> + pr_debug("gfn 0x%llx old_spte 0x%llx new_spte 0x%llx level %d\n",
> + gfn, old_spte, new_spte, level);

Please no. Not in upstream. Yeah, for development it's fine, but sprinkling
printks all over the MMU eventually just results in stale printks, e.g. see all
of the pgprintk crud that got ripped out a while back.

> } else {
> void *external_spt = get_external_spt(gfn, new_spte, level);
>
> + KVM_BUG_ON(was_present, kvm);
> KVM_BUG_ON(!external_spt, kvm);
> ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
> }
>
> base-commit: d2c7662a6ea1c325a9ae878b3f1a265264bcd18b
> --
> 2.45.2
>