Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

From: Waiman Long
Date: Mon Sep 26 2016 - 18:02:42 EST

On 09/23/2016 09:02 AM, Thomas Gleixner wrote:
On Thu, 22 Sep 2016, Waiman Long wrote:
Locking was done mostly by lock stealing. This is where most of the
performance benefit comes from, not optimistic spinning.
How does the lock latency distribution of all this look like and how fair
is the whole thing?
The TO futexes are unfair as can be seen from the min/max thread times listed
above. It took the fastest thread 0.07s to complete all the locking
operations, whereas the slowest one needed 2.65s. However, the situation
reverses when I changed the critical section to a 1us sleep. In this case,
1us sleep is going to add another syscall and therefor scheduling, so what?

Or did you just extend the critical section busy time?

The 1us sleep will cause the spinning to stop and make all the waiters sleep. This is to simulate the extreme case where TO futex may not have the performance advantage.

there will be no optimistic spinning. The performance results for 100k locking
operations were listed below.

wait-wake futex PI futex TO futex
--------------- -------- --------
max time 0.06s 9.32s 4.76s
^^^^ ????

Yes, wait-wake futex is the unfair one in this case.

min time 5.59s 9.36s 5.62s
average time 3.25s 9.35s 5.41s

In this case, the TO futexes are fairer but perform worse than the wait-wake
futexes. That is because the lock handoff mechanism limit the amount of lock
stealing in the TO futexes while the wait-wake futexes have no such
restriction. When I disabled lock handoff, the TO futexes would then perform
similar to the wait-wake futexes.
So the benefit of these new fangled futexes is only there for extreme short
critical sections and a gazillion of threads fighting for the same futex,

Not really. Lock stealing will help performance when a gazillion of threads fighting for the same futex. Optimistic spinning will help to reduce the lock transfer latency because the waiter isn't sleeping no matter the number of threads. One set of data that I haven't shown so far is that the performance delta between wait-wait and TO futexes actually increases as the critical section is lengthened. This is because for short critical section, the waiters of wait-wake futex may not actually go to sleep because of the latency introduced by the code that has to be run before they do a final check to see if the futex value change before going to sleep. The longer the critical section, the higher the chance that they actually sleep and hence their performance is getting worse relative to the TO futexes.

For example, with the critical section of 50 pause instructions instead of 5, the performance gain is about 5X instead of about 1.6X in the latter case.

I really wonder how the average programmer should pick the right flavour,
not to talk about any useful decision for something like glibc to pick the
proper one.

I would say that TO futexes will have better performance in most cases. Of course, I still need to run some real world benchmarks to quantify the effect of the new futexes. I am hoping to get suggestion of what is a good set of benchmarks to run.