[PATCH-tip v5 17/21] TP-futex: Group readers together in wait queue

From: Waiman Long
Date: Fri Feb 03 2017 - 13:04:39 EST


All the TP futex lock waiters are serialized in the kernel using a
kernel mutex which acts like a wait queue. The order at which the
waiters popped out from the wait queue will affect performance when
exclusive (writer) and shared (reader) lock waiters are mixed in the
queue. The worst case scenarios will be something like RWRWRW... in
term of ordering where no parallelism in term of lock ownership
can happen.

To improve throughput, the readers are now grouped together as a single
entity in the wait queue. The first reader that enters the mutex wait
queue will become the leader of the group. The other readers will
spin on the group leader via an OSQ lock. When the futex is put in
the shared mode, either by the group leader or by an external reader
the spinning readers in the reader group will then acquire the read
lock successively.

The spinning readers in the group will get disbanded when the group
leader goes to sleep. In this case, all those readers will go into the
mutex wait queue alone and wait for their turn to acquire the TP futex.

On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10
based kernel, the performance of TP rwlock with 1:1 reader/writer
ratio versus one based on the wait-wake futexes as well as the Glibc
rwlock with a microbenchmark (1 worker thread per cpu core) running
for 10s were as follows:

WW futex TP futex Glibc
-------- -------- -----
Total locking ops 35,707,234 58,645,434 10,930,422
Per-thread avg/sec 99,149 162,887 30,362
Per-thread min/sec 93,190 38,641 29,872
Per-thread max/sec 104,213 225,983 30,708
Write lock futex calls 11,161,534 14,094 -
Write unlock futex calls 8,696,121 167 -
Read lock futex calls 1,717,863 6,659 -
Read unlock futex calls 4,316,418 323 -

It can be seen that the throughput of the TP futex is close to 2X
the WW futex and almost 6X the Glibc version in this particular case.

The following table shows the CPU cores scaling for the average
per-thread locking rates (ops/sec):

WW futex TP futex Glibc
-------- -------- -----
9 threads (1 socket) 422,014 647,006 190,236
18 threads (2 sockets) 197,145 330,353 66,934
27 threads (2 sockets) 127,947 213,417 43,641
36 threads (2 sockets) 99,149 162,887 30,362

The following tables shows the average per-thread locking rates
(36 threads) with different reader percentages:

WW futex TP futex Glibc
-------- -------- -----
90% readers 124,426 159,657 60,208
95% readers 148,905 152,315 68,666
100% reader 210,029 191,759 84,316

The rwlocks based on the WW futexes and Glibc prefer readers by
default. So it can be seen that the performance of the WW futex and
Glibc rwlocks increased with higher reader percentages. The TP futex
rwlock, however, prefers writers a bit more than readers. So the
performance didn't increase as the reader percentage rises.

The WW futexes and Glibc rwlocks also have a writer-preferring version.
Their performance with the same tests are as follows:

WW futex TP futex Glibc
-------- -------- -----
90% readers 87,866 159,657 26,057
95% readers 93,193 152,315 32,611
100% reader 193,267 191,759 88,440

With separate 18 reader and 18 writer threads, the the average
per-thread reader and writer locking rates with different load
latencies (L, default = 1) and reader-preferring rwlocks are:

WW futex TP futex Glibc
-------- -------- -----
Reader rate, L=1 381,411 74,059 164,184
Writer rate, L=1 0 240,841 0

Reader rate, L=5 330,732 57,361 150,691
Writer rate, L=5 0 175,400 0

Reader rate, L=50 304,505 17,355 97,066
Writer rate, L=50 0 114,504 0

The corresponding locking rates with writer-preferring rwlocks are:

WW futex TP futex Glibc
-------- -------- -----
Reader rate, L=1 138,805 74,059 54
Writer rate, L=1 31,113 240,841 56,424

Reader rate, L=5 114,414 57,361 24
Writer rate, L=5 28,062 175,400 52,039

Reader rate, L=50 88,619 51,483 5
Writer rate, L=50 21,005 98,005 49,885

With both the WW futex and Glibc rwlocks, lock starvation happened
for the writers with reader-preferring rwlocks. For writer preferring
rwlocks, the WW futex one fared better. The Glibc one, however,
was close to starving the readers.

The TP futex prefers writer in general, but the actual preference
depends on the timing. Lock starvation should not happen on the TP
futexes as long as the underlying kernel mutex is lock starvation
free which is the case for 4.10 and later kernel.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
kernel/futex.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 133 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 90c8c80..16c63b8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -71,6 +71,7 @@
#include <linux/freezer.h>
#include <linux/bootmem.h>
#include <linux/fault-inject.h>
+#include <linux/osq_lock.h>

#include <asm/futex.h>

@@ -241,6 +242,16 @@ struct futex_state {
u32 handoff_pid; /* For TP futexes only */
int locksteal_disabled; /* For TP futexes only */

+ /*
+ * To improve reader throughput in TP futexes, all the readers
+ * in the mutex queue are grouped together. The first reader in the
+ * queue will set first_reader, then the rest of the readers will
+ * spin on the first reader via the OSQ without actually entering
+ * the mutex queue.
+ */
+ struct optimistic_spin_queue reader_osq;
+ struct task_struct *first_reader;
+
enum futex_type type;
union futex_key key;
};
@@ -3390,14 +3401,21 @@ void exit_robust_list(struct task_struct *curr)
* 0 - steals the lock
* 1 - top waiter (mutex owner) acquires the lock
* 2 - handed off the lock
- * 2) bits 08-15: reserved
- * 3) bits 15-30: how many times the task has slept or yield to scheduler
+ * 2) bit 08: 1 if reader spins alone (shared lock only)
+ * bit 09: 1 if reader is a spin group leader (shared lock only)
+ * bits 10-16: reserved
+ * 3) bits 16-30: how many times the task has slept or yield to scheduler
* in futex_spin_on_owner().
*/
#define TP_LOCK_STOLEN 0
#define TP_LOCK_ACQUIRED 1
#define TP_LOCK_HANDOFF 2
+
+#define TP_READER_ALONE 1
+#define TP_READER_GROUP 2
+
#define TP_STATUS_SLEEP(val, sleep) ((val)|((sleep) << 16))
+#define TP_STATUS_ALONE(val, alone) ((val)|((alone) << 8))

/**
* lookup_futex_state - Looking up the futex state structure.
@@ -3440,6 +3458,7 @@ void exit_robust_list(struct task_struct *curr)
state->type = TYPE_TP;
state->key = *key;
state->locksteal_disabled = false;
+ osq_lock_init(&state->reader_osq);
list_add(&state->fs_list, &hb->fs_head);
WARN_ON(atomic_read(&state->refcount) != 1);

@@ -3897,6 +3916,98 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
return (ret < 0) ? ret : TP_STATUS_SLEEP(ret, nsleep);
}

+/**
+ * futex_spin_on_reader - Optimistically spin on first reader
+ * @uaddr: futex address
+ * @pfirst: pointer to first reader
+ * @state: futex state object
+ * @timeout: hrtimer_sleeper structure
+ *
+ * Reader performance will depend on the placement of readers within the
+ * mutex queue. For a queue of 4 readers and 4 writers, for example, the
+ * optimal placement will be either RRRRWWWW or WWWWRRRR. A worse case will
+ * be RWRWRWRW.
+ *
+ * One way to avoid the worst case scenario is to gather all the readers
+ * together as a single unit and place it into the mutex queue. This is done
+ * by having the first reader puts its task structure into state->first_reader
+ * and the rest of the readers optimistically spin on it instead of entering
+ * the mutex queue.
+ *
+ * On exit from this function, the reader would have
+ * 1) acquired a read lock on the futex;
+ * 2) become the first reader and goes into the mutex queue; or
+ * 3) seen the first reader slept and needs to go into the mutex queue alone.
+ *
+ * Any fault on accessing the futex will cause it to return 0 and goes into
+ * the mutex queue.
+ *
+ * Return: > 0 - acquired the read lock
+ * <= 0 - need to goes into the mutex queue
+ */
+static int futex_spin_on_reader(u32 __user *uaddr, struct task_struct **pfirst,
+ struct futex_state *state,
+ struct hrtimer_sleeper *timeout)
+{
+ struct task_struct *first_reader;
+ int ret = 0;
+
+ preempt_disable();
+
+retry:
+ first_reader = READ_ONCE(state->first_reader);
+ if (!first_reader)
+ first_reader = cmpxchg(&state->first_reader, NULL, current);
+ if (!first_reader)
+ goto out; /* Became the first reader */
+
+ if (!osq_lock(&state->reader_osq))
+ goto reschedule;
+
+ for (;;) {
+ u32 uval;
+
+ if (!state->locksteal_disabled) {
+ ret = futex_trylock_preempt_disabled(uaddr,
+ FUTEX_SHARED, &uval);
+ /*
+ * Return if lock acquired or an error happened
+ */
+ if (ret)
+ break;
+ }
+
+ /*
+ * Reread the first reader value again.
+ */
+ first_reader = READ_ONCE(state->first_reader);
+ if (!first_reader)
+ first_reader = cmpxchg(&state->first_reader, NULL,
+ current);
+ if (!first_reader || !first_reader->on_cpu)
+ break;
+
+ if (need_resched()) {
+ osq_unlock(&state->reader_osq);
+ goto reschedule;
+ }
+
+ cpu_relax();
+ }
+ osq_unlock(&state->reader_osq);
+out:
+ *pfirst = first_reader;
+ preempt_enable();
+ return ret;
+
+reschedule:
+ /*
+ * Yield the CPU and retry later.
+ */
+ schedule_preempt_disabled();
+ goto retry;
+}
+
/*
* Userspace tried a 0 -> TID atomic transition of the futex value
* and failed. The kernel side here does the whole locking operation.
@@ -3923,9 +4034,11 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
{
struct hrtimer_sleeper timeout, *to;
struct futex_hash_bucket *hb;
+ struct task_struct *first_reader = NULL;
union futex_key key = FUTEX_KEY_INIT;
struct futex_state *state;
u32 uval, vpid = shared ? FUTEX_SHARED : task_pid_vnr(current);
+ int alone = 0;
int ret;

/*
@@ -3991,6 +4104,17 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);

/*
+ * Spin on the first reader and return if we acquired the read lock.
+ */
+ if (shared) {
+ int rspin_ret = futex_spin_on_reader(uaddr, &first_reader,
+ state, to);
+ if (rspin_ret > 0)
+ goto out_put_state_key;
+ alone = first_reader ? TP_READER_ALONE : TP_READER_GROUP;
+ }
+
+ /*
* Acquiring the serialization mutex.
*
* If we got a signal or has some other error, we need to abort
@@ -4018,6 +4142,12 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
mutex_unlock(&state->mutex);

out_put_state_key:
+ /*
+ * We will be the first reader if (first_reader == NULL).
+ */
+ if (shared && !first_reader)
+ WRITE_ONCE(state->first_reader, NULL);
+
if (!put_futex_state_unlocked(state)) {
/*
* May need to free the futex state object and so must be
@@ -4034,7 +4164,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
}
- return ret;
+ return (ret < 0) ? ret : TP_STATUS_ALONE(ret, alone);
}

/*
--
1.8.3.1