Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible
From: Jan Beulich
Date: Thu Mar 24 2011 - 12:47:25 EST
>>> On 24.03.11 at 15:52, Borislav Petkov <bp@xxxxxxxxx> wrote:
(haven't seen Ingo's original reply, so responding here)
> On Thu, Mar 24, 2011 at 04:56:47AM -0400, Ingo Molnar wrote:
>>
>> * Nikanth Karthikesan <knikanth@xxxxxxx> wrote:
>>
>> > On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
>> > signal can stall other CPUs. And as the number of cores increase this
> penalty
>> > scales proportionately. So it is best to try and avoid atomic instructions
>> > wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if
> it
>> > finds the bit set already.
>> >
>> > Signed-off-by: Nikanth Karthikesan <knikanth@xxxxxxx>
>
> [..]
>
>> > + * test_and_set_bit_lock - Set a bit and return its old value for lock
>> > + * @nr: Bit to set
>> > + * @addr: Address to count from
>> > + *
>> > + * This is the same as test_and_set_bit on x86. But atomic operation is
>> > + * avoided, if the bit was already set.
>> > + */
>> > +static __always_inline int
>> > +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
>> > +{
>> > +#ifdef CONFIG_SMP
>> > + barrier();
>> > + if (test_bit(nr, addr))
>> > + return 1;
>> > +#endif
>> > + return test_and_set_bit(nr, addr);
>> > +}
>>
>> On modern x86 CPUs there's no "#LOCK signal" anymore - it's replaced
>> by a M[O]ESI cache coherency bus. I'd expect modern x86 CPUs to be
>> pretty fast when the cacheline is local and the bit is set already.
Are you certain? Iirc the lock prefix implies minimally a read-for-
ownership (if CPUs are really smart enough to optimize away the
write - I wonder whether that would be correct at all when it
comes to locked operations), which means a cacheline can still be
bouncing heavily.
>> So you really need to back up your patch with actual hard numbers.
>> Putting this code into user-space and using pthreads to loop on
>> the same global variable and testing the before/after effect would
>> be sufficient i think. You can use 'perf stat --repeat 10' kind of
>> measurements to see whether there's any improvement larger than the
>> noise of the measurement.
>
> and Ingo's question is right on the money - is this speedup noticeable
> or does it simply disappear in the noise?
This cacheline bouncing was actually observed and measured
on SGI UV systems, but I'm not certain we're permitted to publish
that data. I'm copying the two SGI guys who had reported that
issue (and the special case fix, which Nikanth simply generalized)
to us, for them to decide.
Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/