Re: [PATCH] KVM: x86/tdp_mmu: Trigger the callback only when an interesting change
From: Sean Christopherson
Date: Tue Oct 08 2024 - 12:07:29 EST
On Tue, Oct 08, 2024, Yan Zhao wrote:
> On Mon, Sep 30, 2024 at 08:54:38AM -0700, Sean Christopherson wrote:
> > On Sun, Sep 29, 2024, Yan Zhao wrote:
> > > On Fri, Sep 27, 2024 at 07:32:10AM -0700, Sean Christopherson wrote:
> > > > * Don't flush if the Accessed bit is cleared, as access tracking tolerates
> > > > * false negatives, and the one path that does care about TLB flushes,
> > > > * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track().
> > > I have a question about why access tracking tolerates false negatives on the
> > > path kvm_mmu_notifier_clear_flush_young().
> > >
> > > kvm_mmu_notifier_clear_flush_young() invokes kvm_flush_remote_tlbs()
> > > only when kvm_age_gfn() returns true. But age_gfn_range()/kvm_age_rmap() will
> > > return false if the old spte is !is_accessed_spte().
> > >
> > > So, if the Access bit is cleared in make_spte(), is a TLB flush required to
> > > avoid that it's not done in kvm_mmu_notifier_clear_flush_young()?
> >
> > Because access tracking in general is tolerant of stale results due to lack of
> > TLB flushes. E.g. on many architectures, the primary MMU has omitted TLB flushes
> > for years (10+ years on x86, commit b13b1d2d8692). The basic argument is that if
> > there is enough memory pressure to trigger reclaim, then there will be enough TLB
> > pressure to ensure that omitting the TLB flush doesn't result in a large number
> > of "bad" reclaims[1]. And conversely, if there isn't much TLB pressure, then the
> > kernel shouldn't be reclaiming.
> >
> > For KVM, I want to completely eliminate the TLB flush[2] for all architectures
> > where it's architecturally legal. Because for KVM, the flushes are often even
> > more expensive than they are for the primary MMU, e.g. due to lack of batching,
> > the cost of VM-Exit => VM-Enter (for architectures without broadcast flushes).
> >
> > [1] https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@xxxxxxxxxxxxxx
> > [2] https://lore.kernel.org/all/Zmnbb-Xlyz4VXNHI@xxxxxxxxxx
>
> It makes sense. Thanks for explanation and the provided links!
>
> Thinking more about the prefetched SPTEs, though the A bit tolerates fault
> negative, do you think we still can have a small optimization to grant A bit to
> prefetched SPTEs if the old_spte has already set it? So that if a prefault
> happens right after a real fault, the A bit would not be cleared, basing on that
> KVM not changing PFNs without first zapping the old SPTE.
> (but I'm not sure if you have already covered this optmication in the
> mega-series).
Already handled :-)
Though I need to test the shadow MMU code, as I was initially thinking the issue
was unique to the TDP MMU (no idea why I thought that).
Hmm, though I think something like what you've proposed may be in order. There
are currently four cases where @prefetch will be true:
1. Prefault
2. Async #PF
3. Prefetching
4. FNAME(sync_spte)
For 1-3, KVM shouldn't overwrite an existing shadow-present SPTE, which is what
my code does.
But for 4, FNAME(sync_spte) _needs_ to overwrite an existing SPTE. And, ignoring
the awful A/D-disabled case, FNAME(prefetch_invalid_gpte) ensures the gPTE is
Accessed, i.e. there's no strong argument for clearing the Accessed bit.
Hrm, but the _only_ "speculative" access type when commit 947da5383069 ("KVM:
MMU: Set the accessed bit on non-speculative shadow ptes") went in was the
FNAME(sync_spte) case (at the time called FNAME(update_pte)). I.e. KVM deliberately
clears the Accessed bit for that case.
But, I'm unconvinced that's actually appropriate. As above, the gPTE is accessed,
and kvm_sync_spte() ensures the SPTE is shadow-present (with an asterisk, because
it deliberately doesn't use is_shadow_present() SPTE so that KVM can sync MMIO
SPTEs).
Aha! Before commit c76e0ad27084 ("KVM: x86/mmu: Mark page/folio accessed only
when zapping leaf SPTEs"), clearing the Accessed bit kinda sorta makes sense,
because KVM propagates the information to the underlying struct page. But when
that code is removed, KVM's SPTEs are the "source of truth" so to speak.
Double aha! Spurious clearing of the Accessed (and Dirty) was mitigated by commit
e6722d9211b2 ("KVM: x86/mmu: Reduce the update to the spte in FNAME(sync_spte)"),
which changed FNAME(sync_spte) to only overwrite SPTEs if the protections are
actually changing.
So at the very least, somewhere in all of this, we should do as you suggest and
explicitly preserve the Accessed bit. Though I'm quite tempted to be more aggressive
and always mark the SPTE as accessed when synchronizing, because odds are very
good that the guest still cares about the page that's pointed at by the unsync
SPTE. E.g. the most common case where FNAME(sync_spte) actually overwrites an
existing SPTE is CoW in the guest.
I'll think a bit more on this, and either go with a variant of the below, or
something like:
if (!prefetch || synchronizing)
spte |= spte_shadow_accessed_mask(spte);
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -163,6 +163,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> int level = sp->role.level;
> u64 spte = SPTE_MMU_PRESENT_MASK;
> bool wrprot = false;
> + bool remove_accessed = prefetch && (!is_shadow_present_pte(old_spte) ||
> + !s_last_spte(old_spte, level) || !is_accessed_spte(old_spte))
> /*
> * For the EPT case, shadow_present_mask has no RWX bits set if
> @@ -178,7 +180,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> spte |= SPTE_TDP_AD_WRPROT_ONLY;
>
> spte |= shadow_present_mask;
> - if (!prefetch)
> + if (!remove_accessed)
> spte |= spte_shadow_accessed_mask(spte);
>
> /*
> @@ -259,7 +261,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> spte |= spte_shadow_dirty_mask(spte);
>
> out:
> - if (prefetch)
> + if (remove_accessed)
> spte = mark_spte_for_access_track(spte);
>
> WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),