Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock

From: Sean Christopherson
Date: Mon Sep 09 2024 - 18:02:59 EST


On Mon, Sep 09, 2024, James Houghton wrote:
> On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Mon, Sep 09, 2024, James Houghton wrote:
> > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > + */
> > > > +#define KVM_RMAP_LOCKED BIT(1)
> > > > +
> > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > > +{
> > > > + unsigned long old_val, new_val;
> > > > +
> > > > + old_val = READ_ONCE(rmap_head->val);
> > > > + if (!old_val)
> > > > + return 0;
> > > > +
> > > > + do {
> > > > + /*
> > > > + * If the rmap is locked, wait for it to be unlocked before
> > > > + * trying acquire the lock, e.g. to bounce the cache line.
> > > > + */
> > > > + while (old_val & KVM_RMAP_LOCKED) {
> > > > + old_val = READ_ONCE(rmap_head->val);
> > > > + cpu_relax();
> > > > + }
> > > > +
> > > > + /*
> > > > + * Recheck for an empty rmap, it may have been purged by the
> > > > + * task that held the lock.
> > > > + */
> > > > + if (!old_val)
> > > > + return 0;
> > > > +
> > > > + new_val = old_val | KVM_RMAP_LOCKED;
> > > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> > >
> > > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > > implicitly has that on x86 (and this code is x86-only), but should we
> > > nonetheless document that we need smp_rmb() (if it indeed required)?
> > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
> >
> > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be
> > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > even wrong.
>
> I don't think smp_mb__after_spinlock() is right either. This seems to
> be used following the acquisition of a spinlock to promote the memory
> ordering from an acquire barrier (that is implicit with the lock
> acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> stronger-than-usual barrier. But I guess I'm not really sure.
>
> In this case, I'm complaining that we don't have the usual memory
> ordering restrictions that come with a spinlock.

What makes you think that?

> > For the !old_val case, there is a address/data dependency that can't be broken by
> > the CPU without violating the x86 memory model (all future actions with relevant
> > memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux
> > kernel memory model, READ_ONCE() is responsible for ensuring that the address
> > dependency can't be morphed into a control dependency by the compiler and
> > subsequently reordered by the CPU.
> >
> > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> > __READ_ONCE() promotes the operation to an acquire.
> >
> > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> > hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock,
> > the rmaps are fully protected by the critical section.
>
> I feel like a spinlock must include the appropriate barriers for it to
> correctly function as a spinlock, so I'm not sure I fully understand
> what you mean here.

On TSO architectures, the atomic _is_ the barrier. E.g. atomic_try_cmpxchg_acquire()
eventually resolves to atomic_try_cmpxchg() on x86. And jumping back to the
"we don't have the usual memory ordering restrictions that come with a spinlock",
x86's virt_spin_lock() uses atomic_try_cmpxchg(). So while using the acquire
variant here is obviously not wrong, it also feels somewhat weird. Though some
of that is undoubtedly due to explicit "acquire" semantics being rather rare in
x86.

> > And for the SPTEs, there is no required ordering. The reader (aging
> > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > either. I.e. there is no requirement that the reader observe a PRESENT
> > SPTE if there is a valid rmap.
>
> This makes sense.
>
> > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > an ordering requirement that doesn't exist.
>
> So we have: the general kvm_rmap_lock() and the read-only
> kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> those names (sorry if it's confusing).
>
> For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> any changes we make to the rmap will be properly published to other
> threads that subsequently grab kvm_rmap_lock() because we had to
> properly release and then re-acquire mmu_lock, which comes with the
> barriers I'm saying we need.
>
> For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> it is possible that there may observe external changes to the
> pte_list_desc while we are in this critical section (for a
> sufficiently weak architecture). The changes that the kvm_rmap_lock()
> (mmu_lock) side made were half-published with an smp_wmb() (really
> [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> hasn't held up its end of the deal.
>
> I don't think READ_ONCE() has the guarantees we need to be a
> sufficient replacement for smp_rmb() or a load-acquire that a real
> lock would use, although I agree with you that, on arm64, it
> apparently *is* a sufficient replacement.
>
> Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> tolerate changes to pte_list_desc while in the critical section. I
> don't think this is true (given for_each_rmap_spte_lockless),
> therefore an smp_rmb() or equivalent is (technically) needed.
>
> Am I confused?

Yes, I think so. kvm_rmap_lock_readonly() creates a critical section that prevents
any pte_list_desc changes. rmap_head->val, and every pte_list_desc that is pointed
at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.

The SPTE _value_ that is pointed at by rmap_head->val or pte_list_desc.sptes[]
can change, but the pointers themselves cannot. And with aging, the code is
completely tolerant of an instable SPTE _value_ because test-only doesn't care
about false negatives/positives, and test-and-clear is itself an atomic access
i.e. won't corrupt a SPTE (and is also tolerant of false positives/negatives).

>
> (Though all of this works just fine as written on x86.)
>
> [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
> [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@xxxxxxxxxx/
> [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190