Re: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter

From: Andy Lutomirski
Date: Thu Dec 12 2019 - 14:41:05 EST

On Wed, Dec 11, 2019 at 2:34 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Wed, Dec 11, 2019 at 10:12:56AM -0800, Andy Lutomirski wrote:

> > > Sure, but we're talking two cpus here.
> > >
> > > u32 var = 0;
> > > u8 *ptr = &var;
> > >
> > > CPU0 CPU1
> > >
> > > xchg(ptr, 1)
> > >
> > > xchg((ptr+1, 1);
> > > r = READ_ONCE(var);
> > >
> > > AFAICT nothing guarantees r == 0x0101. The CPU1 store can be stuck in
> > > CPU1's store-buffer. CPU0's xchg() does not overlap and therefore
> > > doesn't force a snoop or forward.
> >
> > I think I don't quite understand. The final value of var had better
> > be 0x0101 or something is severely wrong.
> > But r can be 0x0100 because
> > nothing in this example guarantees that the total order of the locked
> > instructions has CPU 1's instruction first.
> Assuming CPU1 goes first, why would the load from CPU0 see CPU1's
> ptr[0]? It can be in CPU1 store buffer, and TSO allows regular reads to
> ignore (remote) store-buffers.

What I'm saying is: if CPU0 goes first, then the three operations order as:

xchg(ptr+1, 1);
r = READ_ONCE(var); /* 0x0100 */
xchg(ptr, 1);

Anyway, this is all a bit too hypothetical for me. Is there a clear
example where the total ordering of LOCKed instructions is observable?
That is, is there a sequence of operations on, presumably, two or
three CPUs, such that LOCKed instructions being only partially ordered
allows an outcome that is disallowed by a total ordering? I suspect
there is, but I haven't come up with it yet. (I mean in an x86-like
memory model. Getting this in a relaxed atomic model is easy.)

As a probably bad example:

u32 x0, x1, a1, b0, b1;

CPU 0:
xchg(&x0, 1);
a1 = READ_ONCE(x1);

CPU 1:
xchg(&b, 1);

CPU 2:
b1 = READ_ONCE(x1);
smp_rmb(); /* which is just barrier() on x86 */
b0 = READ_ONCE(x0);

Suppose a1 == 0 and b1 == 1. Then we know that CPU0's READ_ONCE
happened before CPU1's xchg and hence CPU0's xchg happened before
CPU1's xchg. We also know that CPU2's first read observed the write
from CPU1's xchg, which means that CPU2's second read should have been
after CPU0's xchg (because the xchg operations have a total order
according to the SDM). This means that b0 can't be 0.

Hence the outcome (a1, b1, b0) == (0, 1, 0) is disallowed.

It's entirely possible that I screwed up the analysis. But I think
this means that the cache coherency mechanism is doing something more
intelligent than just shoving the x0=1 write into the store buffer and
letting it hang out there. Something needs to make sure that CPU 2
observes everything in the same order that CPU 0 observes, and, as far
as I know it, there is a considerable amount of complexity in the CPUs
that makes sure this happens.

So here's my question: do you have a concrete example of a series of
operations and an outcome that you suspect Intel CPUs allow but that
is disallowed in the SDM?