Re: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time

From: Peter Zijlstra
Date: Mon Jul 13 2015 - 09:48:41 EST

Next message: Peter Zijlstra: "Re: [PATCH 2/7] locking/pvqspinlock: Allow vCPUs kick-ahead"
Previous message: Paolo Bonzini: "Re: [v5 15/19] KVM: eventfd: add irq bypass consumer management"
In reply to: Peter Zijlstra: "Re: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time"
Next in thread: Waiman Long: "[PATCH 2/7] locking/pvqspinlock: Allow vCPUs kick-ahead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Jul 11, 2015 at 04:36:52PM -0400, Waiman Long wrote:
> @@ -229,19 +244,42 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> {
> struct pv_node *pn = (struct pv_node *)node;
> struct __qspinlock *l = (void *)lock;
> - struct qspinlock **lp = NULL;
> + struct qspinlock **lp;
> int loop;
>
> + /*
> + * Initialize lp to a non-NULL value if it has already been in the
> + * pv_hashed state so that pv_hash() won't be called again.
> + */
> + lp = (READ_ONCE(pn->state) == vcpu_hashed) ? (struct qspinlock **)1
> + : NULL;
> for (;;) {
> + WRITE_ONCE(pn->state, vcpu_running);
> for (loop = SPIN_THRESHOLD; loop; loop--) {
> if (!READ_ONCE(l->locked))
> return;
> cpu_relax();
> }
>
> - WRITE_ONCE(pn->state, vcpu_halted);
> + /*
> + * Recheck lock value after setting vcpu_hashed state
> + *
> + * [S] state = vcpu_hashed [S] l->locked = 0
> + * MB MB
> + * [L] l->locked [L] state == vcpu_hashed
> + *
> + * Matches smp_store_mb() in __pv_queued_spin_unlock()
> + */
> + smp_store_mb(pn->state, vcpu_hashed);
> +
> + if (!READ_ONCE(l->locked)) {
> + WRITE_ONCE(pn->state, vcpu_running);
> + return;
> + }
> +
> if (!lp) { /* ONCE */
> lp = pv_hash(lock, pn);
> +
> /*
> * lp must be set before setting _Q_SLOW_VAL
> *
> @@ -305,13 +343,16 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
> * Now that we have a reference to the (likely) blocked pv_node,
> * release the lock.
> */
> - smp_store_release(&l->locked, 0);
> + smp_store_mb(l->locked, 0);
>
> /*
> * At this point the memory pointed at by lock can be freed/reused,
> * however we can still use the pv_node to kick the CPU.
> + * The other vCPU may not really be halted, but kicking an active
> + * vCPU is harmless other than the additional latency in completing
> + * the unlock.
> */
> - if (READ_ONCE(node->state) == vcpu_halted)
> + if (READ_ONCE(node->state) == vcpu_hashed)
> pv_kick(node->cpu);
> }

I think most of that is not actually required; if we let pv_kick_node()
set vcpu_hashed and avoid writing another value in pv_wait_head(), then
__pv_queued_spin_unlock() has two cases:

- pv_kick_node() set _SLOW_VAL, which is the same 'thread' and things
observe program order and we're trivially guaranteed to see
node->state and the hash state.

- pv_wait_head() set it, and because it does the @lp branch the
existing barrier from our cmpxchg() there setting _SLOW_VAL matches
the cmpxchg() in __pv_queued_spin_unlock() observing _SLOW_VAL
and we still observe the right node and hash values.

Giving something like the patch below.

On the s/pv_scan_node/pv_kick_node/ rename; conceptually we still make
the waiter go, the fact that we don't wake it is an implementation
detail.

---
Subject: locking/pvqspinlock: Only kick CPU at unlock time
From: Waiman Long <Waiman.Long@xxxxxx>
Date: Sat, 11 Jul 2015 16:36:52 -0400

For an over-committed guest with more vCPUs than physical CPUs
available, it is possible that a vCPU may be kicked twice before
getting the lock - once before it becomes queue head and once again
before it gets the lock. All these CPU kicking and halting (VMEXIT)
can be expensive and slow down system performance.

This patch adds a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU. The original vcpu_halted
state will be used by pv_wait_node() only to differentiate other
queue nodes from the qeue head.

Cc: Scott J Norton <scott.norton@xxxxxx>
Cc: Douglas Hatch <doug.hatch@xxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
Signed-off-by: Waiman Long <Waiman.Long@xxxxxx>
[peterz: s/pv_scan_node/pv_kick_node/, simplification of pv_wait_head()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Link: http://lkml.kernel.org/r/1436647018-49734-2-git-send-email-Waiman.Long@xxxxxx
---
kernel/locking/qspinlock.c | 6 +--
kernel/locking/qspinlock_paravirt.h | 72 +++++++++++++++++++++++++++---------
2 files changed, 57 insertions(+), 21 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -239,8 +239,8 @@ static __always_inline void set_locked(s

static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_kick_node(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node) { }

@@ -440,7 +440,7 @@ void queued_spin_lock_slowpath(struct qs
cpu_relax();

arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(next);
+ pv_kick_node(lock, next);

release:
/*
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -21,9 +21,14 @@

#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)

+/*
+ * Queue node uses: vcpu_running & vcpu_halted.
+ * Queue head uses: vcpu_running & vcpu_hashed.
+ */
enum vcpu_state {
vcpu_running = 0,
- vcpu_halted,
+ vcpu_halted, /* Used only in pv_wait_node */
+ vcpu_hashed, /* = pv_hash'ed + vcpu_halted */
};

struct pv_node {
@@ -152,7 +157,8 @@ static void pv_init_node(struct mcs_spin

/*
* Wait for node->locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to wake the vcpu again.
+ * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
+ * behalf.
*/
static void pv_wait_node(struct mcs_spinlock *node)
{
@@ -171,9 +177,9 @@ static void pv_wait_node(struct mcs_spin
*
* [S] pn->state = vcpu_halted [S] next->locked = 1
* MB MB
- * [L] pn->locked [RmW] pn->state = vcpu_running
+ * [L] pn->locked [RmW] pn->state = vcpu_hashed
*
- * Matches the xchg() from pv_kick_node().
+ * Matches the cmpxchg() from pv_kick_node().
*/
smp_store_mb(pn->state, vcpu_halted);

@@ -181,9 +187,10 @@ static void pv_wait_node(struct mcs_spin
pv_wait(&pn->state, vcpu_halted);

/*
- * Reset the vCPU state to avoid unncessary CPU kicking
+ * If pv_kick_node() changed us to vcpu_hashed, retain that value
+ * so that pv_wait_head() knows to not also try to hash this lock.
*/
- WRITE_ONCE(pn->state, vcpu_running);
+ cmpxchg(&pn->state, vcpu_halted, vcpu_running);

/*
* If the locked flag is still not set after wakeup, it is a
@@ -193,6 +200,7 @@ static void pv_wait_node(struct mcs_spin
* MCS lock will be released soon.
*/
}
+
/*
* By now our node->locked should be 1 and our caller will not actually
* spin-wait for it. We do however rely on our caller to do a
@@ -201,24 +209,35 @@ static void pv_wait_node(struct mcs_spin
}

/*
- * Called after setting next->locked = 1, used to wake those stuck in
- * pv_wait_node().
+ * Called after setting next->locked = 1 when we're the lock owner.
+ *
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state such
+ * that they're waiting in pv_wait_head(), this avoids a wake/sleep cycle.
*/
-static void pv_kick_node(struct mcs_spinlock *node)
+static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct __qspinlock *l = (void *)lock;

/*
- * Note that because node->locked is already set, this actual
- * mcs_spinlock entry could be re-used already.
+ * If the vCPU is indeed halted, advance its state to match that of
+ * pv_wait_node(). If OTOH this fails, the vCPU was running and will
+ * observe its next->locked value and advance itself.
*
- * This should be fine however, kicking people for no reason is
- * harmless.
+ * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+ */
+ if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
+ return;
+
+ /*
+ * Put the lock into the hash table and set the _Q_SLOW_VAL.
*
- * See the comment in pv_wait_node().
+ * As this is the same vCPU that will check the _Q_SLOW_VAL value and
+ * the hash table later on at unlock time, no atomic instruction is
+ * needed.
*/
- if (xchg(&pn->state, vcpu_running) == vcpu_halted)
- pv_kick(pn->cpu);
+ WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+ (void)pv_hash(lock, pn);
}

/*
@@ -232,6 +251,13 @@ static void pv_wait_head(struct qspinloc
struct qspinlock **lp = NULL;
int loop;

+ /*
+ * If pv_kick_node() already advanced our state, we don't need to
+ * insert ourselves into the hash table anymore.
+ */
+ if (READ_ONCE(pn->state) == vcpu_hashed)
+ lp = (struct qspinlock **)1;
+
for (;;) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
@@ -239,9 +265,16 @@ static void pv_wait_head(struct qspinloc
cpu_relax();
}

- WRITE_ONCE(pn->state, vcpu_halted);
+ /*
+ * Either pv_kick_node() advanced us and ->state is already
+ * vcpu_hashed and this store is superfluous, or this is the
+ * first in which case the below cmpxchg() provides the
+ * required barriers.
+ */
+ WRITE_ONCE(pn->state, vcpu_hashed);
if (!lp) { /* ONCE */
lp = pv_hash(lock, pn);
+
/*
* lp must be set before setting _Q_SLOW_VAL
*
@@ -310,8 +343,11 @@ __visible void __pv_queued_spin_unlock(s
/*
* At this point the memory pointed at by lock can be freed/reused,
* however we can still use the pv_node to kick the CPU.
+ * The other vCPU may not really be halted, but kicking an active
+ * vCPU is harmless other than the additional latency in completing
+ * the unlock.
*/
- if (READ_ONCE(node->state) == vcpu_halted)
+ if (READ_ONCE(node->state) == vcpu_hashed)
pv_kick(node->cpu);
}
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Peter Zijlstra: "Re: [PATCH 2/7] locking/pvqspinlock: Allow vCPUs kick-ahead"
Previous message: Paolo Bonzini: "Re: [v5 15/19] KVM: eventfd: add irq bypass consumer management"
In reply to: Peter Zijlstra: "Re: [PATCH 1/7] locking/pvqspinlock: Only kick CPU at unlock time"
Next in thread: Waiman Long: "[PATCH 2/7] locking/pvqspinlock: Allow vCPUs kick-ahead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]