[PATCH] locking/qrwlock: Allow multiple spinning readers

From: Waiman Long
Date: Sat Mar 19 2016 - 23:22:05 EST


In qrwlock, the reader that is spining on the lock will need to notify
the next reader in the queue when the lock is free. That introduces a
reader-to-reader latency that is not present in the original rwlock.
That is the price for reducing lock cacheline contention. It also
reduces the performance benefit of qrwlock on reader heavy workloads.

However, if we allow a limited number of readers to spin on the
lock simultaneously, we can eliminates some of the reader-to-reader
latencies at the expense of a bit more cacheline contention and
probably more power consumption.

This patch changes the reader slowpath to allow multiple readers to
spin on the lock. The maximum number of concurrent readers allowed
is currently set to 4 to limit the amount of additional cacheline
contention while improving reader performance on most workloads. If
a writer comes to the queue head, however, it will stop additional
readers from coming out.

Using a multi-threaded locking microbenchmark on a 4-socket 40-core
Haswell-EX system, the locking throughput of 4.5-rc6 kernel with or
without the patch were as follows:

1) All locking threads run in the same socket

r:w # of locking rate locking rate % change
ratio threads w/o patch with patch
------ ------- ------------ ------------ --------
1:1 2 6867 Mop/s 6125 Mop/s -10.8%
4 5762 Mop/s 5065 Mop/s -12.1%
6 6211 Mop/s 6876 Mop/s +10.7%
8 6102 Mop/s 6954 Mop/s +14.0%
10:1 2 8808 Mop/s 9446 Mop/s + 7.2%
4 6034 Mop/s 6199 Mop/s + 2.7%
6 6862 Mop/s 7702 Mop/s +12.2%
8 6693 Mop/s 7808 Mop/s +16.7%
50:1 2 10530 Mop/s 11028 Mop/s + 4.7%
4 9065 Mop/s 9015 Mop/s - 0.6%
6 9659 Mop/s 10669 Mop/s +10.5%
8 8927 Mop/s 10441 Mop/s +17.0%

2) 2 locking threads per socket (8 threads total)

r:w locking rate locking rate % change
ratio w/o patch with patch
------ ------------ ------------ --------
10:1 7027 Mop/s 8012 Mop/s +14.0%
50:1 9986 Mop/s 11573 Mop/s +15.9%

With a moderately contended qrwlock (2/4 threads), the performance
benefit was a bit mixed. The microbenchmark also had more run-to-run
variations at that contention level. At higher contention level,
however, there was more than 10% gain in performance.

Signed-off-by: Waiman Long <Waiman.Long@xxxxxxx>
---
kernel/locking/qrwlock.c | 53 ++++++++++++++++++++++++++++-----------------
1 files changed, 33 insertions(+), 20 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index fec0823..bec6d5f 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -23,6 +23,14 @@
#include <asm/qrwlock.h>

/*
+ * More than one reader will be allowed to spin on the lock waiting for the
+ * writer to exit. When more readers are allowed, it reduces the reader lock
+ * acquisition latency, but increases the amount of cacheline contention and
+ * probably power consumption.
+ */
+#define MAX_SPINNING_READERS 4
+
+/*
* This internal data structure is used for optimizing access to some of
* the subfields within the atomic_t cnts.
*/
@@ -43,29 +51,14 @@ struct __qrwlock {
};

/**
- * rspin_until_writer_unlock - inc reader count & spin until writer is gone
- * @lock : Pointer to queue rwlock structure
- * @writer: Current queue rwlock writer status byte
- *
- * In interrupt context or at the head of the queue, the reader will just
- * increment the reader count & wait until the writer releases the lock.
- */
-static __always_inline void
-rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
-{
- while ((cnts & _QW_WMASK) == _QW_LOCKED) {
- cpu_relax_lowlatency();
- cnts = atomic_read_acquire(&lock->cnts);
- }
-}
-
-/**
* queued_read_lock_slowpath - acquire read lock of a queue rwlock
* @lock: Pointer to queue rwlock structure
* @cnts: Current qrwlock lock value
*/
void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
{
+ bool locked = true;
+
/*
* Readers come here when they cannot get the lock without waiting
*/
@@ -78,7 +71,10 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
* semantics) until the lock is available without waiting in
* the queue.
*/
- rspin_until_writer_unlock(lock, cnts);
+ while ((cnts & _QW_WMASK) == _QW_LOCKED) {
+ cpu_relax_lowlatency();
+ cnts = atomic_read_acquire(&lock->cnts);
+ }
return;
}
atomic_sub(_QR_BIAS, &lock->cnts);
@@ -92,14 +88,31 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
* The ACQUIRE semantics of the following spinning code ensure
* that accesses can't leak upwards out of our subsequent critical
* section in the case that the lock is currently held for write.
+ *
+ * The reader increments the reader count & wait until the writer
+ * releases the lock.
*/
cnts = atomic_add_return_acquire(_QR_BIAS, &lock->cnts) - _QR_BIAS;
- rspin_until_writer_unlock(lock, cnts);
+ while ((cnts & _QW_WMASK) == _QW_LOCKED) {
+ if (locked && ((cnts >> _QR_SHIFT) < MAX_SPINNING_READERS)) {
+ /*
+ * Unlock the wait queue so that more readers can
+ * come forward and waiting for the writer to exit
+ * as long as no more than MAX_SPINNING_READERS
+ * readers are present.
+ */
+ arch_spin_unlock(&lock->wait_lock);
+ locked = false;
+ }
+ cpu_relax_lowlatency();
+ cnts = atomic_read_acquire(&lock->cnts);
+ }

/*
* Signal the next one in queue to become queue head
*/
- arch_spin_unlock(&lock->wait_lock);
+ if (locked)
+ arch_spin_unlock(&lock->wait_lock);
}
EXPORT_SYMBOL(queued_read_lock_slowpath);

--
1.7.1