RE: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter

From: David Laight
Date: Mon Dec 16 2019 - 04:59:27 EST

Next message: mathew thomson: "Goodnews for you.Recieve your 'ATM CARD'urgently"
Previous message: Kishon Vijay Abraham I: "[PATCH v4 00/14] PHY: Add support for SERDES in TI's J721E SoC"
Next in thread: Andy Lutomirski: "Re: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Andy Lutomirski
> Sent: 12 December 2019 19:41
> On Wed, Dec 11, 2019 at 2:34 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Wed, Dec 11, 2019 at 10:12:56AM -0800, Andy Lutomirski wrote:
>
> > > > Sure, but we're talking two cpus here.
> > > >
> > > > u32 var = 0;
> > > > u8 *ptr = &var;
> > > >
> > > > CPU0 CPU1
> > > >
> > > > xchg(ptr, 1)
> > > >
> > > > xchg((ptr+1, 1);
> > > > r = READ_ONCE(var);
> > > >
> > > > AFAICT nothing guarantees r == 0x0101. The CPU1 store can be stuck in
> > > > CPU1's store-buffer. CPU0's xchg() does not overlap and therefore
> > > > doesn't force a snoop or forward.
> > >
> > > I think I don't quite understand. The final value of var had better
> > > be 0x0101 or something is severely wrong.
> >
> > > But r can be 0x0100 because
> > > nothing in this example guarantees that the total order of the locked
> > > instructions has CPU 1's instruction first.
> >
> > Assuming CPU1 goes first, why would the load from CPU0 see CPU1's
> > ptr[0]? It can be in CPU1 store buffer, and TSO allows regular reads to
> > ignore (remote) store-buffers.
>
> What I'm saying is: if CPU0 goes first, then the three operations order as:
>
>
>
> xchg(ptr+1, 1);
> r = READ_ONCE(var); /* 0x0100 */
> xchg(ptr, 1);
>
> Anyway, this is all a bit too hypothetical for me. Is there a clear
> example where the total ordering of LOCKed instructions is observable?
> That is, is there a sequence of operations on, presumably, two or
> three CPUs, such that LOCKed instructions being only partially ordered
> allows an outcome that is disallowed by a total ordering? I suspect
> there is, but I haven't come up with it yet. (I mean in an x86-like
> memory model. Getting this in a relaxed atomic model is easy.)
>
> As a probably bad example:
>
> u32 x0, x1, a1, b0, b1;
>
> CPU 0:
> xchg(&x0, 1);
> barrier();
> a1 = READ_ONCE(x1);
>
> CPU 1:
> xchg(&b, 1);
>
> CPU 2:
> b1 = READ_ONCE(x1);
> smp_rmb(); /* which is just barrier() on x86 */
> b0 = READ_ONCE(x0);
>
> Suppose a1 == 0 and b1 == 1. Then we know that CPU0's READ_ONCE
> happened before CPU1's xchg and hence CPU0's xchg happened before
> CPU1's xchg. We also know that CPU2's first read observed the write
> from CPU1's xchg, which means that CPU2's second read should have been
> after CPU0's xchg (because the xchg operations have a total order
> according to the SDM). This means that b0 can't be 0.
>
> Hence the outcome (a1, b1, b0) == (0, 1, 0) is disallowed.
>
> It's entirely possible that I screwed up the analysis. But I think
> this means that the cache coherency mechanism is doing something more
> intelligent than just shoving the x0=1 write into the store buffer and
> letting it hang out there. Something needs to make sure that CPU 2
> observes everything in the same order that CPU 0 observes, and, as far
> as I know it, there is a considerable amount of complexity in the CPUs
> that makes sure this happens.
>
> So here's my question: do you have a concrete example of a series of
> operations and an outcome that you suspect Intel CPUs allow but that
> is disallowed in the SDM?

I'm not sure that example is at all relevant.
READ_ONCE() doesn't have any sequencing requirements on the cpu, just on the compiler.
(The same is true of any 'atomic read'.)
Locks work because the RMW operation is atomic, all writes occur in sequence
and (I think) reads cannot 'overtake' the RMW sequence.
In particular you don't want speculative reads being done before the RMW
instruction completes.
I believe that, on x86, this is guaranteed without requiring an LFENCE provided
there are no non-temporal (ie cache bypassing) or write-combining memory cycles.
Other architectures require additional software.

To get any kind of hardware interlock from a read (eg of an atomic counter)
you need to read it with a RMW cycle (eg XADD $0,memory).

The point about bis(&x,1) and bis(&(x+1),1) is that, is you wait long enough
you'll always see both bits set.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Next message: mathew thomson: "Goodnews for you.Recieve your 'ATM CARD'urgently"
Previous message: Kishon Vijay Abraham I: "[PATCH v4 00/14] PHY: Add support for SERDES in TI's J721E SoC"
Next in thread: Andy Lutomirski: "Re: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]