[PATCH RFC 1/2] qrwlock: A queue read/write lock implementation

From: Waiman Long
Date: Fri Jul 12 2013 - 21:35:09 EST


This patch introduces a read/write lock implementation that put waiting
readers and writers into a queue instead of actively contending the
lock like the regular read/write lock. This will improve performance in
highly contended situation by reducing the cache line bouncing effect.

In addition, the queue read/write lock is more deterministic even
though there is still a smaller chance for lock stealing if the reader
or writer comes at the right moment. Other than that, lock granting
is done in a FIFO manner. The only downside is the size increase in
the lock structure by 4 bytes for 32-bit systems and by 12 bytes for
64-bit systems.

This patch allows the replacement of architecture specific
implementation of read/write lock by this generic version of queue
read/write lock. Two new config parameters are introduced:

1. QUEUE_RWLOCK
A select-able option that enables the building and replacement of
architecture specific read/write lock.
2. ARCH_QUEUE_RWLOCK
Have to be defined in arch/$(arch)/Kconfig to enable QUEUE_RWLOCK

In term of single-thread performance (no contention), a 256K
lock/unlock loop was run on a 2.4GHz Westmere x86-64 CPU. The following
table shows the average time for a single lock/unlock sequence:

Lock Type Time (ns)
--------- ---------
Ticket spinlock 15.7
Read lock 17.0
Write lock 17.2
Queue read lock 31.1
Queue write lock 13.6

While the queue read lock is almost double the time of a read lock
or spinlock, the queue write lock is the fastest of them all. The
execution time can probably be reduced a bit by allowing inlining of
the lock fast paths like the other locks.

To see how the queue write lock can be used as a replacement for ticket
spinlock (just like rwsem can be used as replacement of mutex), the
mb_cache_spinlock in fs/mbcache.c, which is a bottleneck in the disk
workload (ext4 FS) of the AIM7 benchmark, was converted to both a queue
write lock and a regular write lock. When running on a 8-socket 80-core
DL980 system, the performance improvement was shown in the table below.

+-----------------+------------+------------+-----------+---------+
| Configuration | Mean JPM | Mean JPM | Mean JPM | qrwlock |
| |Vanilla 3.10|3.10-qrwlock|3.10-rwlock| %Change |
+-----------------+-----------------------------------------------+
| | User Range 10 - 100 |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off | 441374 | 532774 | 637205 | +20.7% |
| 8 nodes, HT on | 449373 | 584387 | 641965 | +30.1% |
+-----------------+-----------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off | 226870 | 354490 | 371593 | +56.3% |
| 8 nodes, HT on | 205997 | 314891 | 306378 | +52.9% |
+-----------------+-----------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off | 208234 | 321420 | 343694 | +54.4% |
| 8 nodes, HT on | 199612 | 297853 | 252569 | +49.2% |
+-----------------+------------+------------+-----------+---------+

Apparently, the regular read/write lock performs even better than
the queue read/write lock in some cases. This is probably due to the
fact that mb_cache_spinlock is in a separate cache line from the data
being manipulated.

Looking at the fserver and new_fserver workloads (ext4 FS) where the
journal->j_state_lock (a read/write lock) is a bottleneck especially
when HT is on, we see a slight different story. The j_state_lock is
an embedded read/write lock which is in the same cacheline as some
of the data being manipulated. The replacement by a queue read/write
lock gives the following improvement.

+--------------------+-------------+--------------+---------------+
| Workload |mean % change|mean % change |mean % change |
| |10-100 users |200-1000 users|1100-2000 users|
+--------------------+-------------+--------------+---------------+
|fserver (HT off) | +0.3% | -0.1% | +1.9% |
|fserver (HT on) | -0.1% | +32.2% | +34.7% |
|new_fserver (HT on) | +0.8% | +0.9% | +0.9% |
|new_fserver (HT off)| -1.2% | +29.8% | +40.5% |
+--------------------+-------------+--------------+---------------+

Signed-off-by: Waiman Long <Waiman.Long@xxxxxx>
---
include/asm-generic/qrwlock.h | 124 +++++++++++++++++++++
lib/Kconfig | 11 ++
lib/Makefile | 1 +
lib/qrwlock.c | 246 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 382 insertions(+), 0 deletions(-)
create mode 100644 include/asm-generic/qrwlock.h
create mode 100644 lib/qrwlock.c

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
new file mode 100644
index 0000000..d758dd0
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,124 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@xxxxxx>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include <linux/types.h>
+#include <asm/cmpxchg.h>
+#include <asm/barrier.h>
+
+#if (CONFIG_NR_CPUS < 65536)
+typedef u16 __nrcpu_t;
+typedef u32 __nrcpupair_t;
+#else
+typedef u32 __nrcpu_t;
+typedef u64 __nrcpupair_t;
+#endif
+
+/*
+ * The queue read/write lock data structure
+ * The reader stealing flag, if sea,t will enable reader at the head of the
+ * waiting queue to steal the read lock even if a writer is waiting. However,
+ * that may cause starving of a writer by a perpetual stream of readers.
+ */
+struct qrwnode {
+ struct qrwnode *next;
+ bool wait; /* Waiting flag */
+};
+
+typedef struct qrwlock {
+ union {
+ __nrcpupair_t rw; /* Reader/writer number pair */
+ struct {
+ u8 writer; /* Set if a writer is waiting */
+ u8 rsteal; /* Reader stealing flag */
+ __nrcpu_t readers; /* Number of active readers */
+ };
+ };
+ struct qrwnode *waitq; /* Tail of waiting queue */
+} arch_rwlock_t;
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue read/writer lock structure
+ */
+static inline int queue_read_can_lock(struct qrwlock *lock)
+{
+ return lock->writer == 0;
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue read/writer lock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+ return lock->rw == 0;
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ */
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+ /*
+ * Atomically decrement the reader count
+ */
+ add_smp(&lock->readers, -1);
+}
+
+/**
+ * queue_write_unlock - release write lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+ ACCESS_ONCE(lock->writer) = 0;
+ smp_wmb();
+}
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock(struct qrwlock *);
+extern void queue_write_lock(struct qrwlock *);
+extern int queue_read_trylock(struct qrwlock *);
+extern int queue_write_trylock(struct qrwlock *);
+
+/*
+ * Initializier
+ */
+#define __ARCH_RW_LOCK_UNLOCKED { { .rw = 0 }, .waitq = NULL }
+#define __ARCH_RW_LOCK_UNLOCKED_RSTEAL \
+ { { .writer = 0, .rsteal = 1, .readers = 0 }, waitq = NULL }
+
+/*
+ * Remapping read/write lock architecture specific functions to the
+ * corresponding queue read/write lock functions.
+ */
+#define arch_read_can_lock(l) queue_read_can_lock(l)
+#define arch_write_can_lock(l) queue_write_can_lock(l)
+#define arch_read_lock(l) queue_read_lock(l)
+#define arch_write_lock(l) queue_write_lock(l)
+#define arch_read_trylock(l) queue_read_trylock(l)
+#define arch_write_trylock(l) queue_write_trylock(l)
+#define arch_read_unlock(l) queue_read_unlock(l)
+#define arch_write_unlock(l) queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 35da513..de32799 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -412,6 +412,17 @@ config SIGNATURE
Implementation is done using GnuPG MPI library

#
+# Generic queue read/write lock
+#
+config QUEUE_RWLOCK
+ bool "Generic queue read/write lock"
+ depends on ARCH_QUEUE_RWLOCK
+ help
+ Use a NUMA optimized queue read/write lock implementation. This
+ improves performance under lock contention on systems with more
+ than two sockets.
+
+#
# libfdt files, only selected if needed.
#
config LIBFDT
diff --git a/lib/Makefile b/lib/Makefile
index 7baccfd..2888c17 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN $@
clean-files += oid_registry_data.c

obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
diff --git a/lib/qrwlock.c b/lib/qrwlock.c
new file mode 100644
index 0000000..a206fae
--- /dev/null
+++ b/lib/qrwlock.c
@@ -0,0 +1,246 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@xxxxxx>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <asm-generic/qrwlock.h>
+
+/*
+ * Compared with regular read/write lock, the queue read/write lock has
+ * has the following advantages:
+ * 1. It is more deterministic. Even though there is a slight chance of
+ * stealing the lock if come at the right moment, the granting of the
+ * lock is mostly in FIFO order.
+ * 2. It is faster in high contention situation.
+ *
+ * The only downside is that the lock is 4 bytes larger in 32-bit systems
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot waiting queue. The writers that follows will have to wait
+ * in the combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue read/write lock is faster
+ * in high contention situation. The writer lock is also faster in single
+ * thread operations. Therefore, queue read/write lock can be considered as
+ * a replacement for those spinlocks that are highly contended as long as
+ * an increase in lock size is not an issue.
+ */
+
+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue read/writer lock structure
+ * @node: Node pointer to be added to the queue
+ */
+static __always_inline void
+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+ struct qrwnode *prev;
+
+ node->next = NULL;
+ node->wait = true;
+ barrier();
+ prev = xchg(&lock->waitq, node);
+ if (prev) {
+ prev->next = node;
+ smp_wmb();
+ /*
+ * Wait until the waiting flag is off
+ */
+ while (ACCESS_ONCE(node->wait))
+ cpu_relax();
+ } else
+ node->wait = false;
+}
+
+/**
+ * signal_next - Signal the next one in queue to be at the head
+ * @lock: Pointer to queue read/writer lock structure
+ * @node: Node pointer to the current head of queue
+ */
+static __always_inline void
+signal_next(struct qrwlock *lock, struct qrwnode *node)
+{
+ struct qrwnode *next;
+
+ /*
+ * Notify the next one in queue or clear the waiting queue
+ */
+ if (ACCESS_ONCE(lock->waitq) == node) {
+ if (cmpxchg(&lock->waitq, node, NULL) == node)
+ return;
+ }
+ /*
+ * Wait until the next one in queue set up the next field
+ */
+ while (!likely(next = ACCESS_ONCE(node->next)))
+ cpu_relax();
+ /*
+ * The next one in queue is now at the head
+ */
+ ACCESS_ONCE(next->wait) = false;
+ smp_wmb();
+}
+
+/**
+ * queue_read_lock - acquire read lock of a queue read/write lock
+ * @lock: Pointer to queue read/writer lock structure
+ */
+void queue_read_lock(struct qrwlock *lock)
+{
+ struct qrwlock old, new;
+ struct qrwnode node;
+
+ old.rw = ACCESS_ONCE(lock->rw);
+ while (likely(!old.writer)) {
+ /*
+ * Atomically increment the reader count while writer is 0
+ */
+ new.rw = old.rw;
+ new.readers++;
+
+ if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+ return;
+ cpu_relax();
+ old.rw = ACCESS_ONCE(lock->rw);
+ }
+ /*
+ * Slowpath
+ * Put the reader into the waiting queue
+ */
+ wait_in_queue(lock, &node);
+
+ /*
+ * At the head of the wait queue now, wait until no writer is pending
+ * or when the reader stealing flag is set and readers are present.
+ * Then try to atomically inc the reader number.
+ */
+ while (true) {
+ old.rw = ACCESS_ONCE(lock->rw);
+ if (old.writer && (!old.rsteal || !old.readers)) {
+ cpu_relax();
+ continue;
+ }
+ new.rw = old.rw;
+ new.readers++;
+ if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+ break;
+ cpu_relax();
+ }
+ signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_read_lock);
+
+
+/*
+ * queue_read_trylock - try to acquire read lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+int queue_read_trylock(struct qrwlock *lock)
+{
+ struct qrwlock old, new;
+
+ old.rw = ACCESS_ONCE(lock->rw);
+ if (unlikely(old.writer))
+ return 0;
+ new.rw = old.rw;
+ new.readers++;
+
+ if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+ return 1;
+ cpu_relax();
+ return 0;
+}
+EXPORT_SYMBOL(queue_read_trylock);
+
+/**
+ * queue_write_lock - acquire write lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ */
+void queue_write_lock(struct qrwlock *lock)
+{
+ struct qrwnode node, *next;
+
+ if (likely(!ACCESS_ONCE(lock->writer))) {
+ /*
+ * Atomically set the writer to 1, then wait until reader
+ * count goes to 0.
+ */
+ if (xchg(&lock->writer, 1) == 0) {
+ while (ACCESS_ONCE(lock->readers))
+ cpu_relax();
+ return;
+ }
+ cpu_relax();
+ }
+ /*
+ * Slowpath
+ * Put the writer into the waiting queue
+ */
+ wait_in_queue(lock, &node);
+
+ /*
+ * At the head of the wait queue now, wait until no writer is pending
+ * and then atomically set it again.
+ */
+ while (true) {
+ if (ACCESS_ONCE(lock->writer)) {
+ cpu_relax();
+ continue;
+ }
+ if (xchg(&lock->writer, 1) != 0) {
+ cpu_relax();
+ continue;
+ }
+ break;
+ }
+ /*
+ * Wait until the reader count go to zero
+ */
+ while (ACCESS_ONCE(lock->readers))
+ cpu_relax();
+
+ signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_write_lock);
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+int queue_write_trylock(struct qrwlock *lock)
+{
+ struct qrwlock old, new;
+
+ old.rw = ACCESS_ONCE(lock->rw);
+ if (!old.rw) {
+ /*
+ * Atomically set the writer to 1 if readers = 0
+ */
+ new.rw = old.rw;
+ new.writer = 1;
+ if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+ return 1;
+ cpu_relax();
+ }
+ return 0;
+}
+EXPORT_SYMBOL(queue_write_trylock);
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/