Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock

From: James Houghton
Date: Mon Sep 09 2024 - 16:47:22 EST


On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Sep 09, 2024, James Houghton wrote:
> > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > + */
> > > +#define KVM_RMAP_LOCKED BIT(1)
> > > +
> > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > +{
> > > + unsigned long old_val, new_val;
> > > +
> > > + old_val = READ_ONCE(rmap_head->val);
> > > + if (!old_val)
> > > + return 0;
> > > +
> > > + do {
> > > + /*
> > > + * If the rmap is locked, wait for it to be unlocked before
> > > + * trying acquire the lock, e.g. to bounce the cache line.
> > > + */
> > > + while (old_val & KVM_RMAP_LOCKED) {
> > > + old_val = READ_ONCE(rmap_head->val);
> > > + cpu_relax();
> > > + }
> > > +
> > > + /*
> > > + * Recheck for an empty rmap, it may have been purged by the
> > > + * task that held the lock.
> > > + */
> > > + if (!old_val)
> > > + return 0;
> > > +
> > > + new_val = old_val | KVM_RMAP_LOCKED;
> > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> >
> > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > implicitly has that on x86 (and this code is x86-only), but should we
> > nonetheless document that we need smp_rmb() (if it indeed required)?
> > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
>
> Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be
> smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> even wrong.

I don't think smp_mb__after_spinlock() is right either. This seems to
be used following the acquisition of a spinlock to promote the memory
ordering from an acquire barrier (that is implicit with the lock
acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
stronger-than-usual barrier. But I guess I'm not really sure.

In this case, I'm complaining that we don't have the usual memory
ordering restrictions that come with a spinlock.

> For the !old_val case, there is a address/data dependency that can't be broken by
> the CPU without violating the x86 memory model (all future actions with relevant
> memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux
> kernel memory model, READ_ONCE() is responsible for ensuring that the address
> dependency can't be morphed into a control dependency by the compiler and
> subsequently reordered by the CPU.
>
> I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> __READ_ONCE() promotes the operation to an acquire.
>
> Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock,
> the rmaps are fully protected by the critical section.

I feel like a spinlock must include the appropriate barriers for it to
correctly function as a spinlock, so I'm not sure I fully understand
what you mean here.

> And for the SPTEs, there
> is no required ordering. The reader (aging thread) can observe a !PRESENT or a
> PRESENT SPTE, and must be prepared for either. I.e. there is no requirement that
> the reader observe a PRESENT SPTE if there is a valid rmap.

This makes sense.

> So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> an ordering requirement that doesn't exist.

So we have: the general kvm_rmap_lock() and the read-only
kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
those names (sorry if it's confusing).

For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
any changes we make to the rmap will be properly published to other
threads that subsequently grab kvm_rmap_lock() because we had to
properly release and then re-acquire mmu_lock, which comes with the
barriers I'm saying we need.

For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
it is possible that there may observe external changes to the
pte_list_desc while we are in this critical section (for a
sufficiently weak architecture). The changes that the kvm_rmap_lock()
(mmu_lock) side made were half-published with an smp_wmb() (really
[3]), but the read side didn't use a load-acquire or smp_rmb(), so it
hasn't held up its end of the deal.

I don't think READ_ONCE() has the guarantees we need to be a
sufficient replacement for smp_rmb() or a load-acquire that a real
lock would use, although I agree with you that, on arm64, it
apparently *is* a sufficient replacement.

Now this isn't a problem if the kvm_rmap_lock_readonly() side can
tolerate changes to pte_list_desc while in the critical section. I
don't think this is true (given for_each_rmap_spte_lockless),
therefore an smp_rmb() or equivalent is (technically) needed.

Am I confused?

(Though all of this works just fine as written on x86.)

[1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
[2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@xxxxxxxxxx/
[3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190