Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks
From: Paolo Bonzini
Date: Tue Nov 07 2017 - 06:20:19 EST
On 03/11/2017 16:35, Waiman Long wrote:
> Currently, all the lock waiters entering the slowpath will do one
> lock stealing attempt to acquire the lock. That helps performance,
> especially in VMs with over-committed vCPUs. However, the current
> pvqspinlocks still don't perform as good as unfair locks in many cases.
> On the other hands, unfair locks do have the problem of lock starvation
> that pvqspinlocks don't have.
>
> This patch combines the best attributes of an unfair lock and a
> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
> mode. A lock waiter goes into the unfair mode when there are waiters
> in the wait queue but the pending bit isn't set. Otherwise, it will
> go into the queued mode waiting in the queue for its turn.
>
> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
> (make -j<n>) was done in a VM with unpinned vCPUs 3 times with the
> best time selected and <n> is the number of vCPUs available. The build
> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
> with various number of vCPUs are as follows:
>
> vCPUs pvqlock hybrid pvqlock unfair lock
> ----- ------- -------------- -----------
> 30 342.1s 329.1s 329.1s
> 36 314.1s 305.3s 307.3s
> 45 345.0s 302.1s 306.6s
> 54 365.4s 308.6s 307.8s
> 72 358.9s 293.6s 303.9s
> 108 343.0s 285.9s 304.2s
>
> The hybrid pvqspinlock performs better or comparable to the unfair
> lock.
>
> By turning on QUEUED_LOCK_STAT, the table below showed the number
> of lock acquisitions in unfair mode and queue mode after a kernel
> build with various number of vCPUs.
>
> vCPUs queued mode unfair mode
> ----- ----------- -----------
> 30 9,130,518 294,954
> 36 10,856,614 386,809
> 45 8,467,264 11,475,373
> 54 6,409,987 19,670,855
> 72 4,782,063 25,712,180
>
> It can be seen that as the VM became more and more over-committed,
> the ratio of locks acquired in unfair mode increases. This is all
> done automatically to get the best overall performance as possible.
>
> The table below shows the kernel build times on a smaller 2-socket
> 16-core 32-thread E5-2620 v4 system.
>
> vCPUs pvqlock hybrid pvqlock unfair lock
> ----- ------- -------------- -----------
> 16 436.8s 433.4s 435.6s
> 36 366.2s 364.8s 364.5s
> 48 423.6s 376.3s 370.2s
> 64 433.1s 376.6s 376.8s
>
> Again, the performance of the hybrid pvqspinlock was comparable to
> that of the unfair lock.
>
> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
> ---
> kernel/locking/qspinlock_paravirt.h | 28 +++++++++++++++++++++-------
> 1 file changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> index 4355568..405a923 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -60,21 +60,35 @@ struct pv_node {
> #include "qspinlock_stat.h"
>
> /*
> + * Hybrid PV queued/unfair lock
> + *
> * By replacing the regular queued_spin_trylock() with the function below,
> * it will be called once when a lock waiter enter the PV slowpath before
> - * being queued. By allowing one lock stealing attempt here when the pending
> - * bit is off, it helps to reduce the performance impact of lock waiter
> - * preemption without the drawback of lock starvation.
> + * being queued. By allowing lock stealing attempts here when there are
> + * waiters but the pending bit is off, it helps to reduce the performance
> + * impact of lock waiter preemption without the drawback of lock starvation.
> */
> #define queued_spin_trylock(l) pv_queued_spin_steal_lock(l)
> static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
> {
> struct __qspinlock *l = (void *)lock;
>
> - if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
> - (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> - qstat_inc(qstat_pv_lock_stealing, true);
> - return true;
> + /*
> + * Stay in unfair lock mode as long as waiters are present but
> + * the pending bit isn't set.
> + */
> + for (;;) {
> + int val = atomic_read(&lock->val);
> +
> + if (!(val & _Q_LOCKED_PENDING_MASK) &&
> + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> + qstat_inc(qstat_pv_lock_stealing, true);
> + return true;
> + }
> + if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
> + break;
> +
> + cpu_relax();
> }
>
> return false;
>
Awesome! Thanks Waiman.
Paolo