[PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

From: Waiman Long
Date: Fri Nov 03 2017 - 11:36:17 EST


Currently, all the lock waiters entering the slowpath will do one
lock stealing attempt to acquire the lock. That helps performance,
especially in VMs with over-committed vCPUs. However, the current
pvqspinlocks still don't perform as good as unfair locks in many cases.
On the other hands, unfair locks do have the problem of lock starvation
that pvqspinlocks don't have.

This patch combines the best attributes of an unfair lock and a
pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
mode. A lock waiter goes into the unfair mode when there are waiters
in the wait queue but the pending bit isn't set. Otherwise, it will
go into the queued mode waiting in the queue for its turn.

On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
(make -j<n>) was done in a VM with unpinned vCPUs 3 times with the
best time selected and <n> is the number of vCPUs available. The build
times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
with various number of vCPUs are as follows:

vCPUs pvqlock hybrid pvqlock unfair lock
----- ------- -------------- -----------
30 342.1s 329.1s 329.1s
36 314.1s 305.3s 307.3s
45 345.0s 302.1s 306.6s
54 365.4s 308.6s 307.8s
72 358.9s 293.6s 303.9s
108 343.0s 285.9s 304.2s

The hybrid pvqspinlock performs better or comparable to the unfair
lock.

By turning on QUEUED_LOCK_STAT, the table below showed the number
of lock acquisitions in unfair mode and queue mode after a kernel
build with various number of vCPUs.

vCPUs queued mode unfair mode
----- ----------- -----------
30 9,130,518 294,954
36 10,856,614 386,809
45 8,467,264 11,475,373
54 6,409,987 19,670,855
72 4,782,063 25,712,180

It can be seen that as the VM became more and more over-committed,
the ratio of locks acquired in unfair mode increases. This is all
done automatically to get the best overall performance as possible.

The table below shows the kernel build times on a smaller 2-socket
16-core 32-thread E5-2620 v4 system.

vCPUs pvqlock hybrid pvqlock unfair lock
----- ------- -------------- -----------
16 436.8s 433.4s 435.6s
36 366.2s 364.8s 364.5s
48 423.6s 376.3s 370.2s
64 433.1s 376.6s 376.8s

Again, the performance of the hybrid pvqspinlock was comparable to
that of the unfair lock.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
kernel/locking/qspinlock_paravirt.h | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 4355568..405a923 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -60,21 +60,35 @@ struct pv_node {
#include "qspinlock_stat.h"

/*
+ * Hybrid PV queued/unfair lock
+ *
* By replacing the regular queued_spin_trylock() with the function below,
* it will be called once when a lock waiter enter the PV slowpath before
- * being queued. By allowing one lock stealing attempt here when the pending
- * bit is off, it helps to reduce the performance impact of lock waiter
- * preemption without the drawback of lock starvation.
+ * being queued. By allowing lock stealing attempts here when there are
+ * waiters but the pending bit is off, it helps to reduce the performance
+ * impact of lock waiter preemption without the drawback of lock starvation.
*/
#define queued_spin_trylock(l) pv_queued_spin_steal_lock(l)
static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
{
struct __qspinlock *l = (void *)lock;

- if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
- (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
- qstat_inc(qstat_pv_lock_stealing, true);
- return true;
+ /*
+ * Stay in unfair lock mode as long as waiters are present but
+ * the pending bit isn't set.
+ */
+ for (;;) {
+ int val = atomic_read(&lock->val);
+
+ if (!(val & _Q_LOCKED_PENDING_MASK) &&
+ (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+ qstat_inc(qstat_pv_lock_stealing, true);
+ return true;
+ }
+ if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
+ break;
+
+ cpu_relax();
}

return false;
--
1.8.3.1