Re: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter
From: Andy Lutomirski
Date: Wed Dec 11 2019 - 13:13:13 EST
On Wed, Dec 11, 2019 at 9:52 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Nov 22, 2019 at 01:23:30PM -0800, Andy Lutomirski wrote:
> > On Fri, Nov 22, 2019 at 12:31 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Nov 22, 2019 at 05:48:14PM +0000, Luck, Tony wrote:
> > > > > When we use byte ops, we must consider the word as 4 independent
> > > > > variables. And in that case the later load might observe the lock-byte
> > > > > state from 3, because the modification to the lock byte from 4 is in
> > > > > CPU2's store-buffer.
> > > >
> > > > So we absolutely violate this with the optimization for constant arguments
> > > > to set_bit(), clear_bit() and change_bit() that are implemented as byte ops.
> > > >
> > > > So is code that does:
> > > >
> > > > set_bit(0, bitmap);
> > > >
> > > > on one CPU. While another is doing:
> > > >
> > > > set_bit(mybit, bitmap);
> > > >
> > > > on another CPU safe? The first operates on just one byte, the second on 8 bytes.
> > >
> > > It is safe if all you care about is the consistency of that one bit.
> > >
> >
> > I'm still lost here. Can you explain how one could write code that
> > observes an issue? My trusty SDM, Vol 3 8.2.2 says "Locked
> > instructions have a total order."
>
> This is the thing I don't fully believe. Per this thread the bus-lock is
> *BAD* and not used for normal LOCK prefixed operations. But without the
> bus-lock it becomes very hard to guarantee total order.
>
> After all, if some CPU doesn't observe a specific variable, it doesn't
> care where in the order it fell. So I'm thinking they punted and went
> with some partial order that is near enough that it becomes very hard to
> tell the difference the moment you actually do observe stuff.
I hope that, if the SDM is indeed wrong, that Intel would fix the SDM.
It's definitely not fun to try to understand locking if we don't trust
the manual.
>
> > 8.2.3.9 says "Loads and Stores Are
> > Not Reordered with Locked Instructions." Admittedly, the latter is an
> > "example", but the section is very clear about the fact that a locked
> > instruction prevents reordering of a load or a store issued by the
> > same CPU relative to the locked instruction *regardless of whether
> > they overlap*.
>
> IIRC this rule is CPU-local.
>
> Sure, but we're talking two cpus here.
>
> u32 var = 0;
> u8 *ptr = &var;
>
> CPU0 CPU1
>
> xchg(ptr, 1)
>
> xchg((ptr+1, 1);
> r = READ_ONCE(var);
>
> AFAICT nothing guarantees r == 0x0101. The CPU1 store can be stuck in
> CPU1's store-buffer. CPU0's xchg() does not overlap and therefore
> doesn't force a snoop or forward.
I think I don't quite understand. The final value of var had better
be 0x0101 or something is severely wrong. But r can be 0x0100 because
nothing in this example guarantees that the total order of the locked
instructions has CPU 1's instruction first.
>
> From the perspective of the LOCK prefixed instructions CPU0 never
> observes the variable @ptr. And therefore doesn't need to provide order.
I suspect that the implementation works on whole cache lines for
everything except the actual store buffer entries, which would mean
that CPU 0 does think it observed ptr[0].
>
> Note how the READ_ONCE() is a normal load on CPU0, and per the rules is
> only forced to happen after it's own LOCK prefixed instruction, but it
> is free to observe ptr[0,2,3] from before, only ptr[1] will be forwarded
> from its own store-buffer.
>
> This is exactly the one reorder TSO allows.
If so, then our optimized smp_mb() has all kinds of problems, no?
>
> > I understand that the CPU is probably permitted to optimize a LOCK RMW
> > operation such that it retires before the store buffers of earlier
> > instructions are fully flushed, but only if the store buffer and cache
> > coherency machinery work together to preserve the architecturally
> > guaranteed ordering.
>
> Maybe, maybe not. I'm very loathe to trust this without things being
> better specified.
>
> Like I said, it is possible that it all works, but the way I understand
> things I _really_ don't want to rely on it.
>
> Therefore, I've written:
>
> u32 var = 0;
> u8 *ptr = &var;
>
> CPU0 CPU1
>
> xchg(ptr, 1)
>
> set_bit(8, ptr);
>
> r = READ_ONCE(var);
>
> Because then the LOCK BTSL overlaps with the LOCK XCHGB and CPU0 now
> observes the variable @ptr and therefore must force order.
>
> Did this clarify, or confuse more?
Probably confuses more.
If you're actual concerned that the SDM is wrong, I think that roping
in some architects would be a good idea.
I still think that making set_bit() do 32-bit or smaller accesses is okay.