On Wed, Feb 08, 2017 at 12:05:40PM +0800, Boqun Feng wrote:yep, there is still a race. We have to fix it.
On Wed, Feb 08, 2017 at 11:39:10AM +0800, Xinhui Pan wrote:
2016-12-26 4:26 GMT+08:00 Waiman Long <longman@xxxxxxxxxx>:
A number of cmpxchg calls in qspinlock_paravirt.h were replaced by moreWe can't use _release here, a full barrier is needed.
relaxed versions to improve performance on architectures that use LL/SC.
All the locking related cmpxchg's are replaced with the _acquire
The cmpxchg's related to hashing are replaced by either by the _release
or the _relaxed variants. See the inline comment for details.
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
- Add comments in changelog and code for the rationale of the change.
kernel/locking/qspinlock_paravirt.h | 50 ++++++++++++++++++++++++------
1 file changed, 33 insertions(+), 17 deletions(-)
@@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock *node,
struct mcs_spinlock *prev)
* If pv_kick_node() changed us to vcpu_hashed, retain that
* value so that pv_wait_head_or_lock() knows to not also
* to hash this lock.
+ * The smp_store_mb() and control dependency above will
+ * that state change won't happen before that.
+ * with pv_kick_node() wrt hashing by this waiter or by the
+ * lock holder is done solely by the state variable. There
+ * no other ordering requirement.
- cmpxchg(&pn->state, vcpu_halted, vcpu_running);
+ cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running);
* If the locked flag is still not set after wakeup, it is
@@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock,
struct mcs_spinlock *node)
* pv_wait_node(). If OTOH this fails, the vCPU was running and
* observe its next->locked value and advance itself.
- * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+ * Matches with smp_store_mb() and cmpxchg_relaxed() in
+ * A release barrier is used here to ensure that node->locked is
+ * always set before changing the state. See comment in
- if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
+ if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed)
+ != vcpu_halted)
There is pv_kick_node vs pv_wait_head_or_lock
[w] l->locked = _Q_SLOW_VAL //reordered here
if (READ_ONCE(pn->state) == vcpu_hashed) //False.
lp = (struct qspinlock **)1;
[STORE] pn->state = vcpu_hashed lp = pv_hash(lock,
(xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.
This analysis is correct, but..
Hmm.. look at this again, I don't think this analysis is meaningful,
let's say the reordering didn't happen, we still got(similar to your
if (READ_ONCE(pn->state) == vcpu_hashed) // false.
lp = (struct qspinlock **)1;
cmpxchg(pn->state, vcpu_halted, vcpu_hashed);
lp = pv_hash(lock, pn);
if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.
Actually, I think this or your case could not happen because we have
cmpxchg(pn->state, vcpu_halted, vcpu_running);
in pv_wait_node(), which makes us either observe vcpu_hashed or set
pn->state to vcpu_running before pv_kick_node() trying to do the hash.
I may miss something subtle, but does switching back to cmpxchg() couldyes, just fix this cmpxchg and then no RCU stall.
fix the RCU stall you observed?
Then the same lock has hashed twice but only unhashed once. So at last as
the hash table grows big, we hit RCU stall.
I hit RCU stall when I run netperf benchmark
how will a big hash table hit RCU stall? Do you have the call trace for
your RCU stall?