Re: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter

From: Peter Zijlstra
Date: Wed Dec 11 2019 - 12:52:20 EST


On Fri, Nov 22, 2019 at 01:23:30PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 22, 2019 at 12:31 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Nov 22, 2019 at 05:48:14PM +0000, Luck, Tony wrote:
> > > > When we use byte ops, we must consider the word as 4 independent
> > > > variables. And in that case the later load might observe the lock-byte
> > > > state from 3, because the modification to the lock byte from 4 is in
> > > > CPU2's store-buffer.
> > >
> > > So we absolutely violate this with the optimization for constant arguments
> > > to set_bit(), clear_bit() and change_bit() that are implemented as byte ops.
> > >
> > > So is code that does:
> > >
> > > set_bit(0, bitmap);
> > >
> > > on one CPU. While another is doing:
> > >
> > > set_bit(mybit, bitmap);
> > >
> > > on another CPU safe? The first operates on just one byte, the second on 8 bytes.
> >
> > It is safe if all you care about is the consistency of that one bit.
> >
>
> I'm still lost here. Can you explain how one could write code that
> observes an issue? My trusty SDM, Vol 3 8.2.2 says "Locked
> instructions have a total order."

This is the thing I don't fully believe. Per this thread the bus-lock is
*BAD* and not used for normal LOCK prefixed operations. But without the
bus-lock it becomes very hard to guarantee total order.

After all, if some CPU doesn't observe a specific variable, it doesn't
care where in the order it fell. So I'm thinking they punted and went
with some partial order that is near enough that it becomes very hard to
tell the difference the moment you actually do observe stuff.

> 8.2.3.9 says "Loads and Stores Are
> Not Reordered with Locked Instructions." Admittedly, the latter is an
> "example", but the section is very clear about the fact that a locked
> instruction prevents reordering of a load or a store issued by the
> same CPU relative to the locked instruction *regardless of whether
> they overlap*.

IIRC this rule is CPU-local.

Sure, but we're talking two cpus here.

u32 var = 0;
u8 *ptr = &var;

CPU0 CPU1

xchg(ptr, 1)

xchg((ptr+1, 1);
r = READ_ONCE(var);

AFAICT nothing guarantees r == 0x0101. The CPU1 store can be stuck in
CPU1's store-buffer. CPU0's xchg() does not overlap and therefore
doesn't force a snoop or forward.

>From the perspective of the LOCK prefixed instructions CPU0 never
observes the variable @ptr. And therefore doesn't need to provide order.

Note how the READ_ONCE() is a normal load on CPU0, and per the rules is
only forced to happen after it's own LOCK prefixed instruction, but it
is free to observe ptr[0,2,3] from before, only ptr[1] will be forwarded
from its own store-buffer.

This is exactly the one reorder TSO allows.

> I understand that the CPU is probably permitted to optimize a LOCK RMW
> operation such that it retires before the store buffers of earlier
> instructions are fully flushed, but only if the store buffer and cache
> coherency machinery work together to preserve the architecturally
> guaranteed ordering.

Maybe, maybe not. I'm very loathe to trust this without things being
better specified.

Like I said, it is possible that it all works, but the way I understand
things I _really_ don't want to rely on it.

Therefore, I've written:

u32 var = 0;
u8 *ptr = &var;

CPU0 CPU1

xchg(ptr, 1)

set_bit(8, ptr);

r = READ_ONCE(var);

Because then the LOCK BTSL overlaps with the LOCK XCHGB and CPU0 now
observes the variable @ptr and therefore must force order.

Did this clarify, or confuse more?