Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock

From: Sean Christopherson
Date: Tue Sep 03 2024 - 17:27:31 EST


On Tue, Sep 03, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > +/*
> > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> > + * operates with mmu_lock held for write), but rmaps can be walked without
> > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> > + * being zapped/dropped _while the rmap is locked_.
> > + *
> > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> > + * done while holding mmu_lock for write. This allows a task walking rmaps
> > + * without holding mmu_lock to concurrently walk the same entries as a task
> > + * that is holding mmu_lock but _not_ the rmap lock. Neither task will modify
> > + * the rmaps, thus the walks are stable.
> > + *
> > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> > + * only the rmap chains themselves are protected. E.g. holding an rmap's lock
> > + * ensures all "struct pte_list_desc" fields are stable.
> > + */
> > +#define KVM_RMAP_LOCKED BIT(1)
> > +
> > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > +{
> > + unsigned long old_val, new_val;
> > +
> > + old_val = READ_ONCE(rmap_head->val);
> > + if (!old_val)
> > + return 0;
>
> I'm having trouble understanding how this bit works. What exactly is
> stopping the rmap from being populated while we have it "locked"?

Nothing prevents the 0=>1 transition, but that's a-ok because walking rmaps for
aging only cares about existing mappings. The key to correctness is that this
helper returns '0' when there are no rmaps, i.e. the caller is guaranteed to do
nothing and thus will never see any rmaps that come along in the future.

> aren't holding the MMU lock at all in the lockless case, and given
> this bit, it is impossible (I think?) for the MMU-write-lock-holding,
> rmap-modifying side to tell that this rmap is locked.
>
> Concretely, my immediate concern is that we will still unconditionally
> write 0 back at unlock time even if the value has changed.

The "readonly" unlocker (not added in this patch) is a nop if the rmap was empty,
i.e. wasn't actually locked.

+static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
+ unsigned long old_val)
+{
+ if (!old_val)
+ return;
+
+ KVM_MMU_WARN_ON(old_val != (rmap_head->val & ~KVM_RMAP_LOCKED));
+ WRITE_ONCE(rmap_head->val, old_val);
+}

The TODO in kvm_rmap_lock() pretty much sums things up: it's safe to call the
"normal", non-readonly versions if and only if mmu_lock is held for write.

+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+ /*
+ * TODO: Plumb in @kvm and add a lockdep assertion that mmu_lock is
+ * held for write.
+ */
+ return __kvm_rmap_lock(rmap_head);
+}