Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
From: Boqun Feng
Date: Sun Feb 19 2017 - 23:58:41 EST
On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
> On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
> > All the locking related cmpxchg's in the following functions are
> > replaced with the _acquire variants:
> > - pv_queued_spin_steal_lock()
> > - trylock_clear_pending()
> >
> > This change should help performance on architectures that use LL/SC.
> >
> > On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> > enabled, the performance of a locking microbenchmark with and without
> > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> > were as follows:
> >
> > # of thread w/o patch with patch % Change
> > ----------- --------- ---------- --------
> > 4 4053.3 Mop/s 4223.7 Mop/s +4.2%
> > 8 3310.4 Mop/s 3406.0 Mop/s +2.9%
> > 12 2576.4 Mop/s 2674.6 Mop/s +3.8%
> >
> > Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
> > ---
> >
> > v2->v3:
> > - Reduce scope by relaxing cmpxchg's in fast path only.
> >
> > v1->v2:
> > - Add comments in changelog and code for the rationale of the change.
> >
> > kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
> > 1 file changed, 9 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> > index e6b2f7a..a59dc05 100644
> > --- a/kernel/locking/qspinlock_paravirt.h
> > +++ b/kernel/locking/qspinlock_paravirt.h
> > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
> > struct __qspinlock *l = (void *)lock;
> >
> > if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
> > - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> > + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> > qstat_inc(qstat_pv_lock_stealing, true);
> > return true;
> > }
> > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
> >
> > /*
> > * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
> > - * just to be sure that it will get it.
> > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> > + * lock just to be sure that it will get it.
> > */
> > static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> > {
> > struct __qspinlock *l = (void *)lock;
> >
> > return !READ_ONCE(l->locked) &&
> > - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> > - == _Q_PENDING_VAL);
> > + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
> > + _Q_LOCKED_VAL) == _Q_PENDING_VAL);
> > }
> > #else /* _Q_PENDING_BITS == 8 */
> > static __always_inline void set_pending(struct qspinlock *lock)
> > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> > */
> > old = val;
> > new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> > - val = atomic_cmpxchg(&lock->val, old, new);
> > + val = atomic_cmpxchg_acquire(&lock->val, old, new);
> >
> > if (val == old)
> > return 1;
> > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
> > * observe its next->locked value and advance itself.
> > *
> > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> > + *
> > + * We can't used relaxed form of cmpxchg here as the loading of
> > + * pn->state can happen before setting next->locked in some archs.
> > */
> > if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>
> Hi Waiman.
>
> cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
> e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
> to something like:
>
> _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:
>
Yes, sorry for be late for this one.
So Waiman, the fact is that in this case, we want the following code
sequence:
CPU 0 CPU 1
================= ====================
{pn->state = vcpu_running, node->locked = 0}
smp_store_smb(&pn->state, vcpu_halted):
WRITE_ONCE(pn->state, vcpu_halted);
smp_mb();
r1 = READ_ONCE(node->locked);
arch_mcs_spin_unlock_contented();
WRITE_ONCE(node->locked, 1)
cmpxchg(&pn->state, vcpu_halted, vcpu_hashed);
never ends up in:
r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
value vcpu_running).
We can have such a guarantee if cmpxchg has a smp_mb() before its load
part, which is true for PPC. But semantically, cmpxchg() doesn't provide
any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
in Cc for his insight ;-)).
So a possible "fix"(in case ARM64 will use qspinlock some day), would be
replace cmpxchg() with smp_mb() + cmpxchg_relaxed().
Regards,
Boqun
> Andrea
>
>
> > return;
> > --
> > 1.8.3.1
> >
Attachment:
signature.asc
Description: PGP signature