Re: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter

From: Andy Lutomirski
Date: Mon Dec 16 2019 - 12:23:08 EST


On Mon, Dec 16, 2019 at 1:59 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> From: Andy Lutomirski
> > Sent: 12 December 2019 19:41
> > On Wed, Dec 11, 2019 at 2:34 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, Dec 11, 2019 at 10:12:56AM -0800, Andy Lutomirski wrote:
> >
> > > > > Sure, but we're talking two cpus here.
> > > > >
> > > > > u32 var = 0;
> > > > > u8 *ptr = &var;
> > > > >
> > > > > CPU0 CPU1
> > > > >
> > > > > xchg(ptr, 1)
> > > > >
> > > > > xchg((ptr+1, 1);
> > > > > r = READ_ONCE(var);
> > > > >
> > > > > AFAICT nothing guarantees r == 0x0101. The CPU1 store can be stuck in
> > > > > CPU1's store-buffer. CPU0's xchg() does not overlap and therefore
> > > > > doesn't force a snoop or forward.
> > > >
> > > > I think I don't quite understand. The final value of var had better
> > > > be 0x0101 or something is severely wrong.
> > >
> > > > But r can be 0x0100 because
> > > > nothing in this example guarantees that the total order of the locked
> > > > instructions has CPU 1's instruction first.
> > >
> > > Assuming CPU1 goes first, why would the load from CPU0 see CPU1's
> > > ptr[0]? It can be in CPU1 store buffer, and TSO allows regular reads to
> > > ignore (remote) store-buffers.
> >
> > What I'm saying is: if CPU0 goes first, then the three operations order as:
> >
> >
> >
> > xchg(ptr+1, 1);
> > r = READ_ONCE(var); /* 0x0100 */
> > xchg(ptr, 1);
> >
> > Anyway, this is all a bit too hypothetical for me. Is there a clear
> > example where the total ordering of LOCKed instructions is observable?
> > That is, is there a sequence of operations on, presumably, two or
> > three CPUs, such that LOCKed instructions being only partially ordered
> > allows an outcome that is disallowed by a total ordering? I suspect
> > there is, but I haven't come up with it yet. (I mean in an x86-like
> > memory model. Getting this in a relaxed atomic model is easy.)
> >
> > As a probably bad example:
> >
> > u32 x0, x1, a1, b0, b1;
> >
> > CPU 0:
> > xchg(&x0, 1);
> > barrier();
> > a1 = READ_ONCE(x1);
> >
> > CPU 1:
> > xchg(&b, 1);
> >
> > CPU 2:
> > b1 = READ_ONCE(x1);
> > smp_rmb(); /* which is just barrier() on x86 */
> > b0 = READ_ONCE(x0);
> >
> > Suppose a1 == 0 and b1 == 1. Then we know that CPU0's READ_ONCE
> > happened before CPU1's xchg and hence CPU0's xchg happened before
> > CPU1's xchg. We also know that CPU2's first read observed the write
> > from CPU1's xchg, which means that CPU2's second read should have been
> > after CPU0's xchg (because the xchg operations have a total order
> > according to the SDM). This means that b0 can't be 0.
> >
> > Hence the outcome (a1, b1, b0) == (0, 1, 0) is disallowed.
> >
> > It's entirely possible that I screwed up the analysis. But I think
> > this means that the cache coherency mechanism is doing something more
> > intelligent than just shoving the x0=1 write into the store buffer and
> > letting it hang out there. Something needs to make sure that CPU 2
> > observes everything in the same order that CPU 0 observes, and, as far
> > as I know it, there is a considerable amount of complexity in the CPUs
> > that makes sure this happens.
> >
> > So here's my question: do you have a concrete example of a series of
> > operations and an outcome that you suspect Intel CPUs allow but that
> > is disallowed in the SDM?
>
> I'm not sure that example is at all relevant.
> READ_ONCE() doesn't have any sequencing requirements on the cpu, just on the compiler.
> (The same is true of any 'atomic read'.)

I'm talking specifically about x86 here, where, for example, "Reads
are not reordered with other reads". So READ_ONCE *does* have
sequencing requirements on the CPUs.

Feel free to replace READ_ONCE with MOV in your head if you like :)