Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock

From: James Houghton
Date: Mon Sep 09 2024 - 20:29:58 EST


On Mon, Sep 9, 2024 at 3:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Sep 09, 2024, James Houghton wrote:
> > On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Sep 09, 2024, James Houghton wrote:
> > > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > > + */
> > > > > +#define KVM_RMAP_LOCKED BIT(1)
> > > > > +
> > > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > > > +{
> > > > > + unsigned long old_val, new_val;
> > > > > +
> > > > > + old_val = READ_ONCE(rmap_head->val);
> > > > > + if (!old_val)
> > > > > + return 0;
> > > > > +
> > > > > + do {
> > > > > + /*
> > > > > + * If the rmap is locked, wait for it to be unlocked before
> > > > > + * trying acquire the lock, e.g. to bounce the cache line.
> > > > > + */
> > > > > + while (old_val & KVM_RMAP_LOCKED) {
> > > > > + old_val = READ_ONCE(rmap_head->val);
> > > > > + cpu_relax();
> > > > > + }
> > > > > +
> > > > > + /*
> > > > > + * Recheck for an empty rmap, it may have been purged by the
> > > > > + * task that held the lock.
> > > > > + */
> > > > > + if (!old_val)
> > > > > + return 0;
> > > > > +
> > > > > + new_val = old_val | KVM_RMAP_LOCKED;
> > > > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> > > >
> > > > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > > > implicitly has that on x86 (and this code is x86-only), but should we
> > > > nonetheless document that we need smp_rmb() (if it indeed required)?
> > > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
> > >
> > > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be
> > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > > even wrong.
> >
> > I don't think smp_mb__after_spinlock() is right either. This seems to
> > be used following the acquisition of a spinlock to promote the memory
> > ordering from an acquire barrier (that is implicit with the lock
> > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> > stronger-than-usual barrier. But I guess I'm not really sure.
> >
> > In this case, I'm complaining that we don't have the usual memory
> > ordering restrictions that come with a spinlock.
>
> What makes you think that?

Ok I was under the impression that try_cmpxchg() did not carry memory
ordering guarantees (i.e., I thought it was try_cmpxchg_relaxed()).
Sorry....

So the way I would write this is with try_cmpxchg_relaxed() in the
loop and then smp_rmb() after we break out of the loop, at least for
the `old_val != 0` case. Kinda like this [4].

Did you really want try_cmpxchg(), not try_cmpxchg_relaxed()?

Just comparing against bit_spin_lock, test_and_set_bit_lock()
documents the requirement for acquire barrier semantics[5].

[4]: https://elixir.bootlin.com/linux/v6.10.9/source/kernel/locking/rtmutex.c#L253
[5]: https://elixir.bootlin.com/linux/v6.10.9/source/include/asm-generic/bitops/instrumented-lock.h#L51

>
> > > For the !old_val case, there is a address/data dependency that can't be broken by
> > > the CPU without violating the x86 memory model (all future actions with relevant
> > > memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux
> > > kernel memory model, READ_ONCE() is responsible for ensuring that the address
> > > dependency can't be morphed into a control dependency by the compiler and
> > > subsequently reordered by the CPU.
> > >
> > > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> > > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> > > __READ_ONCE() promotes the operation to an acquire.
> > >
> > > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> > > hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock,
> > > the rmaps are fully protected by the critical section.
> >
> > I feel like a spinlock must include the appropriate barriers for it to
> > correctly function as a spinlock, so I'm not sure I fully understand
> > what you mean here.
>
> On TSO architectures, the atomic _is_ the barrier. E.g. atomic_try_cmpxchg_acquire()
> eventually resolves to atomic_try_cmpxchg() on x86.

Yeah I'm with you here.

> And jumping back to the
> "we don't have the usual memory ordering restrictions that come with a spinlock",
> x86's virt_spin_lock() uses atomic_try_cmpxchg(). So while using the acquire
> variant here is obviously not wrong, it also feels somewhat weird.

Yeah that's fine. atomic_try_cmpxchg() is at least as strong as
atomic_try_cmpxchg_acquire(), so there is no issue.

But if virt_spin_lock() were written to use
atomic_try_cmpxchg_relaxed() (and nothing else) instead, then you'd
complain right? It would work on x86 (I think?), but it's not written
properly! That's basically what I'm saying in this thread.

> Though some
> of that is undoubtedly due to explicit "acquire" semantics being rather rare in
> x86.
>
> > > And for the SPTEs, there is no required ordering. The reader (aging
> > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > > either. I.e. there is no requirement that the reader observe a PRESENT
> > > SPTE if there is a valid rmap.
> >
> > This makes sense.
> >
> > > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > > an ordering requirement that doesn't exist.
> >
> > So we have: the general kvm_rmap_lock() and the read-only
> > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> > those names (sorry if it's confusing).
> >
> > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> > any changes we make to the rmap will be properly published to other
> > threads that subsequently grab kvm_rmap_lock() because we had to
> > properly release and then re-acquire mmu_lock, which comes with the
> > barriers I'm saying we need.
> >
> > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> > it is possible that there may observe external changes to the
> > pte_list_desc while we are in this critical section (for a
> > sufficiently weak architecture). The changes that the kvm_rmap_lock()
> > (mmu_lock) side made were half-published with an smp_wmb() (really
> > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> > hasn't held up its end of the deal.
> >
> > I don't think READ_ONCE() has the guarantees we need to be a
> > sufficient replacement for smp_rmb() or a load-acquire that a real
> > lock would use, although I agree with you that, on arm64, it
> > apparently *is* a sufficient replacement.
> >
> > Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> > tolerate changes to pte_list_desc while in the critical section. I
> > don't think this is true (given for_each_rmap_spte_lockless),
> > therefore an smp_rmb() or equivalent is (technically) needed.
> >
> > Am I confused?
>
> Yes, I think so. kvm_rmap_lock_readonly() creates a critical section that prevents
> any pte_list_desc changes. rmap_head->val, and every pte_list_desc that is pointed
> at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.

I take back what I said about this working on x86. I think it's
possible for there to be a race.

Say...

1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
2. T2 then locks kvm_rmap_lock_readonly().

The modifications that T1 has made are not guaranteed to be visible to
T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
and T2 has an smp_rmb() before reading the data.

Now the way you had it, T2, because it uses try_cmpxchg() with full
ordering, will effectively do a smp_rmb(). But T1 only does an
smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
may enter its critical section and then *later* observe the changes
that T1 made.

Now this is impossible on x86 (IIUC) if, in the compiled list of
instructions, T1's writes occur in the same order that we have written
them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
at compile time is forbidden.

So what I'm saying is:

1. kvm_rmap_unlock() must have an smp_wmb().
2. If you change kvm_rmap_lock() to use try_cmpxchg_relaxed() (which
is what I think you want), then you must also have an smp_rmb()
following a successful cmpxchg/acquisition (at least for the case
where we then follow the pte_list_desc pointer).

> The SPTE _value_ that is pointed at by rmap_head->val or pte_list_desc.sptes[]
> can change, but the pointers themselves cannot. And with aging, the code is
> completely tolerant of an instable SPTE _value_ because test-only doesn't care
> about false negatives/positives, and test-and-clear is itself an atomic access
> i.e. won't corrupt a SPTE (and is also tolerant of false positives/negatives).

I think we're on the same page for the rest of this.

>
> >
> > (Though all of this works just fine as written on x86.)
> >
> > [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
> > [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@xxxxxxxxxx/
> > [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190