Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock

From: James Houghton
Date: Tue Sep 10 2024 - 17:12:14 EST


On Mon, Sep 9, 2024 at 6:42 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Sep 09, 2024, James Houghton wrote:
> > I take back what I said about this working on x86. I think it's
> > possible for there to be a race.
> >
> > Say...
> >
> > 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
> > 2. T2 then locks kvm_rmap_lock_readonly().
> >
> > The modifications that T1 has made are not guaranteed to be visible to
> > T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
> > and T2 has an smp_rmb() before reading the data.
> >
> > Now the way you had it, T2, because it uses try_cmpxchg() with full
> > ordering, will effectively do a smp_rmb(). But T1 only does an
> > smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
> > still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
> > may enter its critical section and then *later* observe the changes
> > that T1 made.
> >
> > Now this is impossible on x86 (IIUC) if, in the compiled list of
> > instructions, T1's writes occur in the same order that we have written
> > them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
> > at compile time is forbidden.
> >
> > So what I'm saying is:
> >
> > 1. kvm_rmap_unlock() must have an smp_wmb().
>
> No, because beating a dead horse, this is not generic code, this is x86.

What prevents the compiler from reordering (non-atomic, non-volatile)
stores that happen before WRITE_ONCE() in kvm_rmap_unlock() to after
the WRITE_ONCE()?

IMV, such a reordering is currently permitted[1] (i.e., a barrier() is
missing), and should the compiler choose to do this, the lock will not
function correctly.

> If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then
> I wouldn't be opposed to doing this in the locking code to document things:
>
> s/READ_ONCE/atomic_read_acquire
> s/WRITE_ONCE/atomic_set_release
> s/try_cmpxchg/atomic_cmpxchg_acquire

I think we can use atomic_long_t.

It would be really great if we did a substitution like this. That
would address my above concern about barrier() (atomic_set_release,
for example, implies a barrier() that we otherwise need to include).

[1]: https://www.kernel.org/doc/Documentation/memory-barriers.txt
(GUARANTEES + COMPILER BARRIER)