RE: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by kernel parameter

From: David Laight
Date: Thu Dec 12 2019 - 05:36:36 EST


From: Peter Zijlstra
> Sent: 11 December 2019 22:39
> On Wed, Dec 11, 2019 at 10:44:16AM -0800, Luck, Tony wrote:
> > On Wed, Dec 11, 2019 at 06:52:02PM +0100, Peter Zijlstra wrote:
> > > Sure, but we're talking two cpus here.
> > >
> > > u32 var = 0;
> > > u8 *ptr = &var;
> > >
> > > CPU0 CPU1
> > >
> > > xchg(ptr, 1)
> > >
> > > xchg((ptr+1, 1);
> > > r = READ_ONCE(var);
> >
> > It looks like our current implementation of set_bit() would already run
> > into this if some call sites for a particular bitmap `pass in constant
> > bit positions (which get optimized to byte wide "orb") while others pass
> > in a variable bit (which execute as 64-bit "bts").
>
> Yes, but luckily most nobody cares.
>
> I only know of two places in the entire kernel where we considered this,
> one is clear_bit_unlock_is_negative_byte() and there we punted and
> stuffed everything in a single byte, and the other is that x86
> queued_fetch_set_pending_acquire() thing I pointed out earlier.
>
> > I'm not a h/w architect ... but I've assumed that a LOCK operation
> > on something contained entirely within a cache line gets its atomicity
> > by keeping exclusive ownership of the cache line.
>
> Right, but like I just wrote to Andy, consider SMT where each thread has
> its own store-buffer. Then the line is local to the core, but there
> still is a remote sb to hide stores in.
>
> I don't know if anything x86 does that, or even allows that, but I'm not
> aware of specs that are clear enough to say either way.

On x86 'xchg' is always 'locked' regardless of whether there is a 'lock' prefix.
set_bit() (etc) include the 'lock' prefix (dunno why this decision was made...).

For locked operations (including misaligned ones) that don't cross cache-line
boundaries the read operation almost certainly locks the cache line (against
a snoop) until the write has updated the cache line.
This won't happen until the write 'drains' from the store buffer.
(I suspect that locked read requests act like write requests in ensuring
that no other cpu has a dirty copy of the cache line, and also marking it dirty.)
Although this will delay the response to the snoop it will only
stall the cpu (or other bus master), not the entire memory 'bus'.

If you read the description of 'lock btr' you'll see that it always does the
write cycle (to complete the atomic RMW expected by the memory
subsystem) even when the bit is clear.

Remote store buffers are irrelevant to locked accesses.
(If you are doing concurrent locked and unlocked accesses to the same
memory location something is badly broken.)

It really can't matter whether one access is a mis-aligned 64bit word
and the other a byte. Both do atomic RMW updates so the result
cannot be unexpected.

In principle two separate 8 bit RMW cycles could be done concurrently
to two halves of a 16 bit 'flag' word without losing any bits or any reads
returning any of the expected 4 values.
Not that any memory system would support such updates.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)