Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock

From: Sean Christopherson
Date: Mon Sep 09 2024 - 21:43:10 EST


On Mon, Sep 09, 2024, James Houghton wrote:
> On Mon, Sep 9, 2024 at 3:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Mon, Sep 09, 2024, James Houghton wrote:
> > > On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be
> > > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > > > even wrong.
> > >
> > > I don't think smp_mb__after_spinlock() is right either. This seems to
> > > be used following the acquisition of a spinlock to promote the memory
> > > ordering from an acquire barrier (that is implicit with the lock
> > > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> > > stronger-than-usual barrier. But I guess I'm not really sure.
> > >
> > > In this case, I'm complaining that we don't have the usual memory
> > > ordering restrictions that come with a spinlock.
> >
> > What makes you think that?
>
> Ok I was under the impression that try_cmpxchg() did not carry memory
> ordering guarantees (i.e., I thought it was try_cmpxchg_relaxed()).
> Sorry....
>
> So the way I would write this is with try_cmpxchg_relaxed() in the
> loop and then smp_rmb() after we break out of the loop, at least for
> the `old_val != 0` case. Kinda like this [4].
>
> Did you really want try_cmpxchg(), not try_cmpxchg_relaxed()?

Heh, why would I care? On x86, they are the same thing. Yeah, I could tease out
some subtle difference to super precisely describe KVM's behavior, but that more
or less defeats the value proposition of total-store ordered architectures: take
on more complexity in hardware (and possibly loss in performance) to allow for
simpler software.

> Just comparing against bit_spin_lock, test_and_set_bit_lock()
> documents the requirement for acquire barrier semantics[5].

Again, that's generic library code.

> [4]: https://elixir.bootlin.com/linux/v6.10.9/source/kernel/locking/rtmutex.c#L253
> [5]: https://elixir.bootlin.com/linux/v6.10.9/source/include/asm-generic/bitops/instrumented-lock.h#L51

...

> > And jumping back to the "we don't have the usual memory ordering
> > restrictions that come with a spinlock", x86's virt_spin_lock() uses
> > atomic_try_cmpxchg(). So while using the acquire variant here is obviously
> > not wrong, it also feels somewhat weird.
>
> Yeah that's fine. atomic_try_cmpxchg() is at least as strong as
> atomic_try_cmpxchg_acquire(), so there is no issue.
>
> But if virt_spin_lock() were written to use atomic_try_cmpxchg_relaxed() (and
> nothing else) instead, then you'd complain right?

I'd complain because someone made me think hard for no reason. :-)

> It would work on x86 (I think?), but it's not written properly! That's
> basically what I'm saying in this thread.
>
> > Though some of that is undoubtedly due to explicit "acquire" semantics
> > being rather rare in x86.
> >
> > > > And for the SPTEs, there is no required ordering. The reader (aging
> > > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > > > either. I.e. there is no requirement that the reader observe a PRESENT
> > > > SPTE if there is a valid rmap.
> > >
> > > This makes sense.
> > >
> > > > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > > > an ordering requirement that doesn't exist.
> > >
> > > So we have: the general kvm_rmap_lock() and the read-only
> > > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> > > those names (sorry if it's confusing).
> > >
> > > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> > > any changes we make to the rmap will be properly published to other
> > > threads that subsequently grab kvm_rmap_lock() because we had to
> > > properly release and then re-acquire mmu_lock, which comes with the
> > > barriers I'm saying we need.
> > >
> > > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> > > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> > > it is possible that there may observe external changes to the
> > > pte_list_desc while we are in this critical section (for a
> > > sufficiently weak architecture). The changes that the kvm_rmap_lock()
> > > (mmu_lock) side made were half-published with an smp_wmb() (really
> > > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> > > hasn't held up its end of the deal.
> > >
> > > I don't think READ_ONCE() has the guarantees we need to be a
> > > sufficient replacement for smp_rmb() or a load-acquire that a real
> > > lock would use, although I agree with you that, on arm64, it
> > > apparently *is* a sufficient replacement.
> > >
> > > Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> > > tolerate changes to pte_list_desc while in the critical section. I
> > > don't think this is true (given for_each_rmap_spte_lockless),
> > > therefore an smp_rmb() or equivalent is (technically) needed.
> > >
> > > Am I confused?
> >
> > Yes, I think so. kvm_rmap_lock_readonly() creates a critical section that prevents
> > any pte_list_desc changes. rmap_head->val, and every pte_list_desc that is pointed
> > at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.
>
> I take back what I said about this working on x86. I think it's
> possible for there to be a race.
>
> Say...
>
> 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
> 2. T2 then locks kvm_rmap_lock_readonly().
>
> The modifications that T1 has made are not guaranteed to be visible to
> T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
> and T2 has an smp_rmb() before reading the data.
>
> Now the way you had it, T2, because it uses try_cmpxchg() with full
> ordering, will effectively do a smp_rmb(). But T1 only does an
> smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
> still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
> may enter its critical section and then *later* observe the changes
> that T1 made.
>
> Now this is impossible on x86 (IIUC) if, in the compiled list of
> instructions, T1's writes occur in the same order that we have written
> them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
> at compile time is forbidden.
>
> So what I'm saying is:
>
> 1. kvm_rmap_unlock() must have an smp_wmb().

No, because beating a dead horse, this is not generic code, this is x86.

> 2. If you change kvm_rmap_lock() to use try_cmpxchg_relaxed() (which
> is what I think you want),

No, I don't. If this were generic code, then it would be a different conversation,
but using a "relaxed" atomic in x86 specific code is nonsensical, because such an
operation simply does not exist in the world of x86. There are memory operations
that have relaxed ordering, but none of them are atomic; they are very much the
exact opposite of atomic.

E.g. the only references to _relaxed that you'll find in x86 are to override the
generics to x86's non-relaxed semantics.

$ git grep _relaxed arch/x86
arch/x86/include/asm/io.h:#define readb_relaxed(a) __readb(a)
arch/x86/include/asm/io.h:#define readw_relaxed(a) __readw(a)
arch/x86/include/asm/io.h:#define readl_relaxed(a) __readl(a)
arch/x86/include/asm/io.h:#define writeb_relaxed(v, a) __writeb(v, a)
arch/x86/include/asm/io.h:#define writew_relaxed(v, a) __writew(v, a)
arch/x86/include/asm/io.h:#define writel_relaxed(v, a) __writel(v, a)
arch/x86/include/asm/io.h:#define readq_relaxed(a) __readq(a)
arch/x86/include/asm/io.h:#define writeq_relaxed(v, a) __writeq(v, a)

There are a few places in KVM x86 where _acquire() and _release() variants are
used, but that's done purely to document the roles and who is doing what, and
almost always (maybe always?) when there are lockless programming shenanigans
in play.

E.g. kvm_recalculate_apic_map() is probably the closest example, where KVM uses
atomic_read_acquire(), atomic_cmpxchg_acquire(), and atomic_cmpxchg_release() to
document the ordering between marking the map as dirty and actually processing
the map. But that is still quite different, as apic_map_dirty is not a spinlock,
there isn't a direct data/address dependency between apic_map_dirty and all of
the data it consumes, _and_ the data that is guarded by apic_map_dirty can be
modified by other vCPUs _while it is being processed_.

Hmm, actually, sev_lock_two_vms() is arguably a better comparison. That's also
a rudimentary spinlock (try-lock only behavior).

If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then
I wouldn't be opposed to doing this in the locking code to document things:

s/READ_ONCE/atomic_read_acquire
s/WRITE_ONCE/atomic_set_release
s/try_cmpxchg/atomic_cmpxchg_acquire

But it's an "unsigned long", and so trying to massage it into the above isn't a
net positive for me. This is gnarly x86 specific code; it's very unlikely that
someone will be able to understand the rest of the rmap code, but won't be able
to understand that the code is safe due to x86's strong memory ordering.

> then you must also have an smp_rmb() following a successful
> cmpxchg/acquisition (at least for the case where we then follow the
> pte_list_desc pointer).