Re: [PATCH] locking/qrwlock: Allow multiple spinning readers

From: Waiman Long
Date: Thu Mar 31 2016 - 18:12:59 EST

On 03/29/2016 04:20 PM, Peter Zijlstra wrote:
On Sat, Mar 19, 2016 at 11:21:19PM -0400, Waiman Long wrote:
In qrwlock, the reader that is spining on the lock will need to notify
the next reader in the queue when the lock is free. That introduces a
reader-to-reader latency that is not present in the original rwlock.
How did you find this 'problem'?

I am constantly on the lookout for twists that can make the code run faster. That change turn out to be good for reader performance and so I send it out to solicit feedback.

That is the price for reducing lock cacheline contention. It also
reduces the performance benefit of qrwlock on reader heavy workloads.

However, if we allow a limited number of readers to spin on the
lock simultaneously, we can eliminates some of the reader-to-reader
latencies at the expense of a bit more cacheline contention and
probably more power consumption.
So the embedded people might not like that much.

It could be. It is always a compromise.

This patch changes the reader slowpath to allow multiple readers to
spin on the lock. The maximum number of concurrent readers allowed
is currently set to 4 to limit the amount of additional cacheline
contention while improving reader performance on most workloads. If
a writer comes to the queue head, however, it will stop additional
readers from coming out.

Using a multi-threaded locking microbenchmark on a 4-socket 40-core
Haswell-EX system, the locking throughput of 4.5-rc6 kernel with or
without the patch were as follows:
Do you have an actual real world benchmark where this makes a

Not yet. Will look out for some real world workload.

* queued_read_lock_slowpath - acquire read lock of a queue rwlock
* @lock: Pointer to queue rwlock structure
* @cnts: Current qrwlock lock value
void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
+ bool locked = true;
* Readers come here when they cannot get the lock without waiting
@@ -78,7 +71,10 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
* semantics) until the lock is available without waiting in
* the queue.
+ while ((cnts& _QW_WMASK) == _QW_LOCKED) {
+ cpu_relax_lowlatency();
+ cnts = atomic_read_acquire(&lock->cnts);
+ }
@@ -92,14 +88,31 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
* The ACQUIRE semantics of the following spinning code ensure
* that accesses can't leak upwards out of our subsequent critical
* section in the case that the lock is currently held for write.
+ *
+ * The reader increments the reader count& wait until the writer
+ * releases the lock.
cnts = atomic_add_return_acquire(_QR_BIAS,&lock->cnts) - _QR_BIAS;
+ while ((cnts& _QW_WMASK) == _QW_LOCKED) {
+ if (locked&& ((cnts>> _QR_SHIFT)< MAX_SPINNING_READERS)) {
+ /*
+ * Unlock the wait queue so that more readers can
+ * come forward and waiting for the writer to exit
+ * as long as no more than MAX_SPINNING_READERS
+ * readers are present.
+ */
+ arch_spin_unlock(&lock->wait_lock);
+ locked = false;
Only 1 more can come forward with this logic. How can you ever get to 4?

Yes, each reader in the unlock path will release one in the queue. If the next one is also a reader, it will release one more and so on until the reader count reach 4 where the process will stop.

Also, what says the next in queue is a reader?

I did say in the changelog that the queue head could be a writer. In that case, the process will stop and the writer will wait until all the readers are gone.