Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

From: Marc Zyngier
Date: Fri Aug 26 2022 - 11:50:23 EST


On Fri, 26 Aug 2022 11:50:24 +0100,
Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> On 8/24/22 00:47, Marc Zyngier wrote:
> >> I definitely don't think I 100% understand all the ordering things since
> >> they're complicated.. but my understanding is that the reset procedure
> >> didn't need memory barrier (unlike pushing, where we have explicit wmb),
> >> because we assumed the userapp is not hostile so logically it should only
> >> modify the flags which is a 32bit field, assuming atomicity guaranteed.
> > Atomicity doesn't guarantee ordering, unfortunately. Take the
> > following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> > D that exist in the ring in that order, and CPU1 performs an ioctl to
> > reset the page state.
> >
> > CPU0:
> > write_flag(A, KVM_DIRTY_GFN_F_RESET)
> > write_flag(B, KVM_DIRTY_GFN_F_RESET)
> > write_flag(C, KVM_DIRTY_GFN_F_RESET)
> > write_flag(D, KVM_DIRTY_GFN_F_RESET)
> > [...]
> >
> > CPU1:
> > ioctl(KVM_RESET_DIRTY_RINGS)
> >
> > Since CPU0 writes do not have any ordering, CPU1 can observe the
> > writes in a sequence that have nothing to do with program order, and
> > could for example observe that GFN A and D have been reset, but not B
> > and C. This in turn breaks the logic in the reset code (B, C, and D
> > don't get reset), despite userspace having followed the spec to the
> > letter. If each was a store-release (which is the case on x86), it
> > wouldn't be a problem, but nothing calls it in the documentation.
> >
> > Maybe that's not a big deal if it is expected that each CPU will issue
> > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> > writes. But expecting this to work across CPUs without any barrier is
> > wishful thinking.
>
> Agreed, but that's a problem for userspace to solve. If userspace
> wants to reset the fields in different CPUs, it has to synchronize
> with its own invoking of the ioctl.

userspace has no choice. It cannot order on its own the reads that the
kernel will do to *other* rings.

> That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done
> after (in the memory-ordering sense) its last write_flag(D,
> KVM_DIRTY_GFN_F_RESET). If there's no such ordering, there's no
> guarantee that the write_flag will have any effect.

The problem isn't on CPU0 The problem is that CPU1 does observe
inconsistent data on arm64, and I don't think this difference in
behaviour is acceptable. Nothing documents this, and there is a baked
in assumption that there is a strong ordering between writes as well
as between writes and read.

> The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl
> was because it takes kvm->slots_lock so the execution would be
> serialized anyway. Turning slots_lock into an rwsem would be even
> worse because it also takes kvm->mmu_lock (since slots_lock is a
> mutex, at least two concurrent invocations won't clash with each other
> on the mmu_lock).

Whatever the reason, the behaviour should be identical on all
architectures. As is is, it only really works on x86, and I contend
this is a bug that needs fixing.

Thankfully, this can be done at zero cost for x86, and at that of a
set of load-acquires on other architectures.

M.

--
Without deviation from the norm, progress is not possible.