Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lockif possible
From: Ingo Molnar
Date: Thu Mar 24 2011 - 13:20:06 EST
* Jan Beulich <JBeulich@xxxxxxxxxx> wrote:
> >>> On 24.03.11 at 15:52, Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> (haven't seen Ingo's original reply, so responding here)
>
> > On Thu, Mar 24, 2011 at 04:56:47AM -0400, Ingo Molnar wrote:
> >>
> >> * Nikanth Karthikesan <knikanth@xxxxxxx> wrote:
> >>
> >> > On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
> >> > signal can stall other CPUs. And as the number of cores increase this
> > penalty
> >> > scales proportionately. So it is best to try and avoid atomic instructions
> >> > wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if
> > it
> >> > finds the bit set already.
> >> >
> >> > Signed-off-by: Nikanth Karthikesan <knikanth@xxxxxxx>
> >
> > [..]
> >
> >> > + * test_and_set_bit_lock - Set a bit and return its old value for lock
> >> > + * @nr: Bit to set
> >> > + * @addr: Address to count from
> >> > + *
> >> > + * This is the same as test_and_set_bit on x86. But atomic operation is
> >> > + * avoided, if the bit was already set.
> >> > + */
> >> > +static __always_inline int
> >> > +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> >> > +{
> >> > +#ifdef CONFIG_SMP
> >> > + barrier();
> >> > + if (test_bit(nr, addr))
> >> > + return 1;
> >> > +#endif
> >> > + return test_and_set_bit(nr, addr);
> >> > +}
> >>
> >> On modern x86 CPUs there's no "#LOCK signal" anymore - it's replaced
> >> by a M[O]ESI cache coherency bus. I'd expect modern x86 CPUs to be
> >> pretty fast when the cacheline is local and the bit is set already.
>
> Are you certain? Iirc the lock prefix implies minimally a read-for-
> ownership (if CPUs are really smart enough to optimize away the
> write - I wonder whether that would be correct at all when it
> comes to locked operations), which means a cacheline can still be
> bouncing heavily.
Yeah. On what workload was this?
Generally you use test_and_set_bit() if you expect it to be 'owned' by whoever
calls it, and released by someone else.
It would be really useful to run perf top on an affected box and see which
kernel function causes this. It might be better to add a test_bit() to the
affected codepath - instead of bloating all test_and_set_bit() users.
Note that the patch can also cause overhead: the test_bit() can miss the cache,
it will bring in the cacheline shared, and the subsequent test_and_set() call
will then dirty the cacheline - so the CPU might miss again and has to wait for
other CPUs to first flush this cacheline.
So we really need more details here.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/