Re: Unlock-lock questions and the Linux Kernel Memory Model

From: Boqun Feng
Date: Thu Nov 30 2017 - 21:44:32 EST


On Thu, Nov 30, 2017 at 10:46:22AM -0500, Alan Stern wrote:
> On Thu, 30 Nov 2017, Boqun Feng wrote:
>
> > On Wed, Nov 29, 2017 at 02:44:37PM -0500, Alan Stern wrote:
> > > On Wed, 29 Nov 2017, Daniel Lustig wrote:
> > >
> > > > While we're here, let me ask about another test which isn't directly
> > > > about unlock/lock but which is still somewhat related to this
> > > > discussion:
> > > >
> > > > "MP+wmb+xchg-acq" (or some such)
> > > >
> > > > {}
> > > >
> > > > P0(int *x, int *y)
> > > > {
> > > > WRITE_ONCE(*x, 1);
> > > > smp_wmb();
> > > > WRITE_ONCE(*y, 1);
> > > > }
> > > >
> > > > P1(int *x, int *y)
> > > > {
> > > > r1 = atomic_xchg_relaxed(y, 2);
> > > > r2 = smp_load_acquire(y);
> > > > r3 = READ_ONCE(*x);
> > > > }
> > > >
> > > > exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > > >
> > > > C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > > > and hence would forbid this outcome.
> > > >
> > > > x86 and Power would forbid this. ARM forbids this via a special-case
> > > > rule in the memory model, ordering atomics with later load-acquires.
> > > >
> > > > RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
> > > > atomics for smp_load_acquire(). It's an "fri; rfi" type of pattern,
> > > > because xchg doesn't have an inherent internal data dependency.
> > > >
> > > > If the Linux memory model is going to forbid this outcome, then
> > > > RISC-V would either need to use fences instead, or maybe we'd need to
> > > > add a special rule to our memory model similarly. This is one detail
> > > > where RISC-V is still actively deciding what to do.
> > > >
> > > > Have you all thought about this test before? Any idea which way you
> > > > are leaning regarding the outcome above?
> > >
> > > Good questions. Currently the LKMM allows this, and I think it should
> > > because xchg doesn't have a dependency from its read to its write.
> > >
> > > On the other hand, herd isn't careful enough in the way it implements
> > > internal dependencies for RMW operations. If we change
> > > atomic_xchg_relaxed(y, 2) to atomic_inc(y) and remove r1 from the test:
> > >
> > > C MP+wmb+inc-acq
> > >
> > > {}
> > >
> > > P0(int *x, int *y)
> > > {
> > > WRITE_ONCE(*x, 1);
> > > smp_wmb();
> > > WRITE_ONCE(*y, 1);
> > > }
> > >
> > > P1(int *x, int *y)
> > > {
> > > atomic_inc(y);
> > > r2 = smp_load_acquire(y);
> > > r3 = READ_ONCE(*x);
> > > }
> > >
> > > exists (1:r2=2 /\ 1:r3=0)
> > >
> > > then the test _should_ be forbidden, but it isn't -- herd doesn't
> > > realize that all atomic RMW operations other than xchg must have a
> > > dependency (either data or control) between their internal read and
> > > write.
> > >
> > > (Although the smp_load_acquire is allowed to execute before the write
> > > part of the atomic_inc, it cannot execute before the read part. I
> > > think a similar argument applies even on ARM.)
> > >
> >
> > But in case of AMOs, which directly send the addition request to memory
> > controller, so there wouldn't be any read part or even write part of the
> > atomic_inc() executed by CPU. Would this be allowed then?
>
> Firstly, sending the addition request to the memory controller _is_ a
> write operation.
>
> Secondly, even though the CPU hardware might not execute a read
> operation during an AMO, the LKMM and herd nevertheless represent the
> atomic update as a specially-annotated read event followed by a write
> event.
>

Ah, right! From the point of view of the model, there are read events
and write events for the atomics.

> In an other-multicopy-atomic system, P0's write to y must become
> visible to P1 before P1 executes the smp_load_acquire, because the
> write was visible to the memory controller when the controller carried
> out the AMO, and the write becomes visible to the memory controller and
> to P1 at the same time (by other-multicopy-atomicity). That's why I
> said the test would be forbidden on ARM.
>

Agreed.

> But even on a non-other-multicopy-atomic system, there has to be some
> synchronization between the memory controller and P1's CPU. Otherwise,
> how could the system guarantee that P1's smp_load_acquire would see the
> post-increment value of y? It seems reasonable to assume that this
> synchronization would also cause P1 to see x=1.
>

I agree with you the "reasonable" part ;-) So basically, memory
controller could only do the write of AMO until P0's second write
propagated to the memory controller(and because of the wmb(), P0's first
write must be already propagated to the memory controller, too), so it
makes sense when the write of AMO propagated from memory controller to
P1, P0's first write is also propagted to P1. IOW, the write of AMO on
memory controller acts at least like a release.

However, some part of myself is still a little paranoid, because to my
understanding, the point of AMO is to get atomic operations executing
as fast as possible, so maybe, AMO has some fast path for the memory
controller to forward a write to the CPU that issues the AMO, in that
way, it will become unreasonable ;-)

With that in mind, I think it's better if herd could provide the type
annotations of atomics for the read and write parts, and we handle it
inside the LKMM's cats and bells, rather than letting herd provide the
internal dependency by default.

Regards,
Boqun

> Alan Stern
>

Attachment: signature.asc
Description: PGP signature