Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

From: Pan Xinhui
Date: Wed Feb 08 2017 - 01:49:22 EST

å 2017/2/8 14:09, Boqun Feng åé:
On Wed, Feb 08, 2017 at 12:05:40PM +0800, Boqun Feng wrote:
On Wed, Feb 08, 2017 at 11:39:10AM +0800, Xinhui Pan wrote:
2016-12-26 4:26 GMT+08:00 Waiman Long <longman@xxxxxxxxxx>:

A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more
relaxed versions to improve performance on architectures that use LL/SC.

All the locking related cmpxchg's are replaced with the _acquire
- pv_queued_spin_steal_lock()
- trylock_clear_pending()

The cmpxchg's related to hashing are replaced by either by the _release
or the _relaxed variants. See the inline comment for details.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>

- Add comments in changelog and code for the rationale of the change.

kernel/locking/qspinlock_paravirt.h | 50 ++++++++++++++++++++++++------
1 file changed, 33 insertions(+), 17 deletions(-)

@@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock *node,
struct mcs_spinlock *prev)
* If pv_kick_node() changed us to vcpu_hashed, retain that
* value so that pv_wait_head_or_lock() knows to not also
* to hash this lock.
+ *
+ * The smp_store_mb() and control dependency above will
+ * that state change won't happen before that.
+ * with pv_kick_node() wrt hashing by this waiter or by the
+ * lock holder is done solely by the state variable. There
+ * no other ordering requirement.
- cmpxchg(&pn->state, vcpu_halted, vcpu_running);
+ cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running);

* If the locked flag is still not set after wakeup, it is
@@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock,
struct mcs_spinlock *node)
* pv_wait_node(). If OTOH this fails, the vCPU was running and
* observe its next->locked value and advance itself.
- * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+ * Matches with smp_store_mb() and cmpxchg_relaxed() in
+ * A release barrier is used here to ensure that node->locked is
+ * always set before changing the state. See comment in
- if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
+ if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed)
+ != vcpu_halted)

hi, Waiman
We can't use _release here, a full barrier is needed.

There is pv_kick_node vs pv_wait_head_or_lock

[w] l->locked = _Q_SLOW_VAL //reordered here

if (READ_ONCE(pn->state) == vcpu_hashed) //False.

lp = (struct qspinlock **)1;

[STORE] pn->state = vcpu_hashed lp = pv_hash(lock,
pv_hash() if
(xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.

This analysis is correct, but..

Hmm.. look at this again, I don't think this analysis is meaningful,
let's say the reordering didn't happen, we still got(similar to your

if (READ_ONCE(pn->state) == vcpu_hashed) // false.
lp = (struct qspinlock **)1;

cmpxchg(pn->state, vcpu_halted, vcpu_hashed);
if(!lp) {
lp = pv_hash(lock, pn);
WRITE_ONCE(l->locked, _Q_SLOW_VAL);
if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.

, right?

Actually, I think this or your case could not happen because we have

cmpxchg(pn->state, vcpu_halted, vcpu_running);

in pv_wait_node(), which makes us either observe vcpu_hashed or set
pn->state to vcpu_running before pv_kick_node() trying to do the hash.

yep, there is still a race. We have to fix it.
so I think
we must check old = xchg(&l->locked, _Q_SLOW_VAL)
if (old == 0)
do something
else if (old == _Q_SLOW_VAL)
do something else

I may miss something subtle, but does switching back to cmpxchg() could
fix the RCU stall you observed?

yes, just fix this cmpxchg and then no RCU stall.


Then the same lock has hashed twice but only unhashed once. So at last as
the hash table grows big, we hit RCU stall.

I hit RCU stall when I run netperf benchmark

how will a big hash table hit RCU stall? Do you have the call trace for
your RCU stall?

maybe too many time on hashing? I am not sure.