Re: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter

From: Peter Zijlstra
Date: Fri Nov 22 2019 - 04:26:17 EST

On Thu, Nov 21, 2019 at 01:22:13PM -0800, Andy Lutomirski wrote:
> On Thu, Nov 21, 2019 at 12:25 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Thu, Nov 21, 2019 at 10:53:03AM -0800, Fenghua Yu wrote:
> >
> > > 4. Otherwise, re-calculate addr to point the 32-bit address which contains
> > > the bit and operate on the bit. No split lock.
> >
> > That sounds confused, Even BT{,CRS} have a RmW size. There is no
> > 'operate on the bit'.
> >
> > Specifically I hard rely on BTSL to be a 32bit RmW, see commit:
> >
> > 7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee")
> >
> Okay, spent a bit of time trying to grok this. Are you saying that
> LOCK BTSL suffices in a case where LOCK BTSB or LOCK XCHG8 would not?


> On x86, all the LOCK operations are full barriers, so they should
> order with adjacent normal accesses even to unrelated addresses,
> right?

Yep, still.

The barrier is not the problem here. Yes the whole value load must come
after the atomic op, be it XCHGB/BTSB or BTSL.

The problem with XCHGB is that it is an 8bit RmW and therefore it makes
no guarantess on the contents of the bytes next to it.

When we use byte ops, we must consider the word as 4 independent
variables. And in that case the later load might observe the lock-byte
state from 3, because the modification to the lock byte from 4 is in
CPU2's store-buffer.

However, by using a 32bit RmW, we force a write on all 4 bytes at the
same time which forces that store from CPU2 to be flushed (because the
operations overlap, whereas an 8byte RmW would not overlap and be

Now, it _might_ work with an XCHGB anyway, _if_ coherency is per
cacheline, and not on a smaller granularity. But I don't think that is
something the architecture guarantees -- they could play fun and games
with partial forwards or whatever.

Specifically, we made this change:

450cbdd0125c ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")

Now, we know that MFENCE will in fact flush the store buffers, and LOCK
prefix being faster does seem to imply it does not. LOCK prefix only
guarantees order, it does not guarantee completion (that's what makes
MFENCE so much more expensive).

Also; if we're going to change the bitops API, that is a generic change
and we must consider all architectures. Me having audited the atomic
bitops width a fair number of times now. to answer question on what code
actually does and/or if a proposed change is valid, indicates the
current state is crap, irrespective of the long vs u32 question.

So I'm saying that if we're going to muck with bitops, lets make it a
simple and consistent thing. This concurrency crap is hard enough
without fancy bells on.