Re: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter

From: Peter Zijlstra
Date: Wed Dec 11 2019 - 17:34:28 EST


On Wed, Dec 11, 2019 at 10:12:56AM -0800, Andy Lutomirski wrote:
> On Wed, Dec 11, 2019 at 9:52 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Nov 22, 2019 at 01:23:30PM -0800, Andy Lutomirski wrote:
> > > On Fri, Nov 22, 2019 at 12:31 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Nov 22, 2019 at 05:48:14PM +0000, Luck, Tony wrote:
> > > > > > When we use byte ops, we must consider the word as 4 independent
> > > > > > variables. And in that case the later load might observe the lock-byte
> > > > > > state from 3, because the modification to the lock byte from 4 is in
> > > > > > CPU2's store-buffer.
> > > > >
> > > > > So we absolutely violate this with the optimization for constant arguments
> > > > > to set_bit(), clear_bit() and change_bit() that are implemented as byte ops.
> > > > >
> > > > > So is code that does:
> > > > >
> > > > > set_bit(0, bitmap);
> > > > >
> > > > > on one CPU. While another is doing:
> > > > >
> > > > > set_bit(mybit, bitmap);
> > > > >
> > > > > on another CPU safe? The first operates on just one byte, the second on 8 bytes.
> > > >
> > > > It is safe if all you care about is the consistency of that one bit.
> > > >
> > >
> > > I'm still lost here. Can you explain how one could write code that
> > > observes an issue? My trusty SDM, Vol 3 8.2.2 says "Locked
> > > instructions have a total order."
> >
> > This is the thing I don't fully believe. Per this thread the bus-lock is
> > *BAD* and not used for normal LOCK prefixed operations. But without the
> > bus-lock it becomes very hard to guarantee total order.
> >
> > After all, if some CPU doesn't observe a specific variable, it doesn't
> > care where in the order it fell. So I'm thinking they punted and went
> > with some partial order that is near enough that it becomes very hard to
> > tell the difference the moment you actually do observe stuff.
>
> I hope that, if the SDM is indeed wrong, that Intel would fix the SDM.
> It's definitely not fun to try to understand locking if we don't trust
> the manual.

I can try and find a HW person; but getting the SDM updated is
difficult.

Anyway, the way I see it, it is a scalability thing. Absolute total
order is untenable, it cannot be, it would mean that if you have your 16
socket 20 core system with hyperthreads, and each logical CPU doing a
LOCK prefix instruction on a separate page, they all 640 need to sit
down and discuss who goes first.

Some sort of partial order that connects where variables/lines are
actually shared is needed. Then again, I'm not a HW person, just a poor
sod trying to understand how this can work.

> > Sure, but we're talking two cpus here.
> >
> > u32 var = 0;
> > u8 *ptr = &var;
> >
> > CPU0 CPU1
> >
> > xchg(ptr, 1)
> >
> > xchg((ptr+1, 1);
> > r = READ_ONCE(var);
> >
> > AFAICT nothing guarantees r == 0x0101. The CPU1 store can be stuck in
> > CPU1's store-buffer. CPU0's xchg() does not overlap and therefore
> > doesn't force a snoop or forward.
>
> I think I don't quite understand. The final value of var had better
> be 0x0101 or something is severely wrong.

> But r can be 0x0100 because
> nothing in this example guarantees that the total order of the locked
> instructions has CPU 1's instruction first.

Assuming CPU1 goes first, why would the load from CPU0 see CPU1's
ptr[0]? It can be in CPU1 store buffer, and TSO allows regular reads to
ignore (remote) store-buffers.

> > From the perspective of the LOCK prefixed instructions CPU0 never
> > observes the variable @ptr. And therefore doesn't need to provide order.
>
> I suspect that the implementation works on whole cache lines for
> everything except the actual store buffer entries, which would mean
> that CPU 0 does think it observed ptr[0].

Quite possible, but consider SMT where each thread has its own
store-buffer. Then the core owns the line, but the value is still not
visible.

I don't know if they want to tie down those semantics.

> > Note how the READ_ONCE() is a normal load on CPU0, and per the rules is
> > only forced to happen after it's own LOCK prefixed instruction, but it
> > is free to observe ptr[0,2,3] from before, only ptr[1] will be forwarded
> > from its own store-buffer.
> >
> > This is exactly the one reorder TSO allows.
>
> If so, then our optimized smp_mb() has all kinds of problems, no?

Why? All smp_mb() guarantees is order between two memops and it does
that just fine.

> > Did this clarify, or confuse more?
>
> Probably confuses more.

Lets put it this way, the first approach has many questions and subtle
points, the second approach must always work without question.

> If you're actual concerned that the SDM is wrong, I think that roping
> in some architects would be a good idea.

I'll see what I can do, getting them to commit to something is always
the hard part.

> I still think that making set_bit() do 32-bit or smaller accesses is okay.

Yes, that really should not be a problem. This whole subthread was more
of a cautionary tale that it is not immediately obviously safe. And like
I've said before, the bitops interface is across all archs, we must
consider the weakest behaviour.

Anyway, we considered these things when we did
clear_bit_unlock_is_negative_byte(), and there is a reason we ended up
with BIT(7), there is no way to slice up a byte.