Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

From: Boqun Feng
Date: Thu Jan 05 2017 - 05:08:10 EST


On Thu, Jan 05, 2017 at 04:16:38PM +0800, Pan Xinhui wrote:
>
>
> å 2017/1/4 17:41, Peter Zijlstra åé:
> > On Tue, Jan 03, 2017 at 05:07:54PM -0500, Waiman Long wrote:
> > > On 01/03/2017 11:18 AM, Peter Zijlstra wrote:
> > > > On Sun, Dec 25, 2016 at 03:26:01PM -0500, Waiman Long wrote:
> > > > > A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more
> > > > > relaxed versions to improve performance on architectures that use LL/SC.
> > > > Claim without numbers ;-)
> > >
> > > Well it is hard to produce actual numbers here as I don't have the setup
> > > to gather data.
> >
> > Surely RHT has big PPC machines around? I know that getting to them is a
> > wee bit of a bother, but they should be available somewhere.
> >
> hi,
>
> I do some tests about cmpxchg and cmpxchg_acquire before on ppc.
>
> loops in 15s of each cmpxchg is below.
>
> cmpxchg_relaxed: 336663
> cmpxchg_release: 369054
> cmpxchg_acquire: 363364
> cmpxchg: 179435
>
> so cmpxchg is really expensive than others.
> but I also have doubt about the cmpxchg_relaxed, it should be the cheapest, but from the tests, release/acquire are faster than it.
>

I have observed something similar before. But the performance number for
a single atomic operation itself is not that useful.

Here is my understanding(basically guessing ;-))

If your testcase is only committing those cmpxchg in a loop then the
overhead of the barrier in _release and _acquire is much small and may
even help the performance because of their side effects on prefetchs or
cache invalidations.

But if your testcase get complex even that committing barriers is not
cheap, you probably will see cmpxchg_relaxed beats _acquire and _release
variants.

Regards,
Boqun

> thanks
> xinhui
>

Attachment: signature.asc
Description: PGP signature