Re: [PATCH] KVM: x86/tdp_mmu: Trigger the callback only when an interesting change

From: Sean Christopherson
Date: Tue Oct 08 2024 - 15:15:24 EST


On Fri, Sep 27, 2024, Sean Christopherson wrote:
> On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > On Fri, Sep 27, 2024, Sean Christopherson wrote:

...

> > > Oh, right, I forgot about that. I'll tweak the changelog to call that out before
> > > posting. Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> > > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> > > anything should be backported it's that commit.
> >
> > Actually, a better idea. I think it makes sense to fully commit to not flushing
> > when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
> > TLB flush.
>
> Oooh, but there's a bug.

Nope, there's not.

> KVM can tolerate/handle stale Dirty/Writable TLB entries when dirty logging,
> but KVM cannot tolerate stale Writable TLB entries when write- protecting for
> shadow paging. The TDP MMU always flushes when clearing the MMU- writable
> flag (modulo a bug that would cause KVM to make the SPTE !MMU-writable in the
> page fault path), but the shadow MMU does not.
>
> So I'm pretty sure we need the below, and then it may or may not make sense to have
> a common "flush needed" helper (outside of the write-protecting flows, KVM probably
> should WARN if MMU-writable is cleared).
>
> ---
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ce8323354d2d..7bd9c296f70e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -514,9 +514,12 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
> /* Rules for using mmu_spte_update:
> * Update the state bits, it means the mapped pfn is not changed.
> *
> - * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
> - * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
> - * spte, even though the writable spte might be cached on a CPU's TLB.
> + * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
> + * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
> + * as KVM allows stale Writable TLB entries to exist. When dirty logging, KVM
> + * flushes TLBs based on whether or not dirty bitmap/ring entries were reaped,
> + * not whether or not SPTEs were modified, i.e. only the write-protected case
> + * needs to precisely flush when modifying SPTEs.
> *
> * Returns true if the TLB needs to be flushed
> */
> @@ -533,8 +536,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> * we always atomically update it, see the comments in
> * spte_has_volatile_bits().
> */
> - if (is_mmu_writable_spte(old_spte) &&
> - !is_writable_pte(new_spte))
> + if (is_mmu_writable_spte(old_spte) && !is_mmu_writable_spte(new_spte))

It took me forever and a day to realize this, but !is_writable_pte(new_spte) is
correct, because the logic is checking if the new SPTE is !Writable, it's *not*
checking to see if the Writable bit is _cleared_. I.e. KVM will flush if the
old SPTE is read-only but MMU-writable.

That said, I'm still going to include this change, albet with a drastically
different changelog. Checking is_mmu_writable_spte() instead of is_writable_pte()
is still desirable, as it avoids unnecessary TLB flushes in the rare case where
KVM "refreshes" a !Writable SPTE. Of course, with the other change to not clobber
SPTEs when prefetching, that scenario becomes even more rare, but it's still worth
doing, especially since IMO it makes it more obvious when KVM _does_ need to do a
remote TLB flush (before dropping mmu_lock).