Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
From: Thomas Gleixner
Date: Fri Sep 23 2016 - 09:04:48 EST
On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 04:38 PM, Thomas Gleixner wrote:
> > > wait-wake futex PI futex TO futex
> > > --------------- -------- --------
> > > max time 3.49s 50.91s 2.65s
> > > min time 3.24s 50.84s 0.07s
> > > average time 3.41s 50.90s 1.84s
> > > sys time 7m22.4s 55.73s 2m32.9s
> > That's really interesting. Do you have any explanation for this massive
> > system time differences?
>
> For the wait-wake futexes, the waiters will be kicked out with EAGAIN without
> sleeping as long as the futex value changes before they acquire the hash
> bucket spinlock. These waiters will attempt to acquire the lock again in the
> user space. If that fails, they will call FUTEX_WAIT again.
>
> In the above test case, the critical section is pretty short and about 4/5 of
> the FUTEX_WAIT calls resulted in an EAGAIN error. So there are a lot more
> user-space/kernel transitions than the TO futexes. I think the difference in
> the number of FUTEX_WAIT calls vs. the FUTEX_LOCK_TO calls causes the
> difference between their sys times.
Ok.
> As for the PI futex, it is a strictly sleeping lock and so won't use that
> much sys time.
That's clear.
> > > Locking was done mostly by lock stealing. This is where most of the
> > > performance benefit comes from, not optimistic spinning.
> > How does the lock latency distribution of all this look like and how fair
> > is the whole thing?
>
> The TO futexes are unfair as can be seen from the min/max thread times listed
> above. It took the fastest thread 0.07s to complete all the locking
> operations, whereas the slowest one needed 2.65s. However, the situation
> reverses when I changed the critical section to a 1us sleep. In this case,
1us sleep is going to add another syscall and therefor scheduling, so what?
Or did you just extend the critical section busy time?
> there will be no optimistic spinning. The performance results for 100k locking
> operations were listed below.
>
> wait-wake futex PI futex TO futex
> --------------- -------- --------
> max time 0.06s 9.32s 4.76s
^^^^ ????
> min time 5.59s 9.36s 5.62s
> average time 3.25s 9.35s 5.41s
>
> In this case, the TO futexes are fairer but perform worse than the wait-wake
> futexes. That is because the lock handoff mechanism limit the amount of lock
> stealing in the TO futexes while the wait-wake futexes have no such
> restriction. When I disabled lock handoff, the TO futexes would then perform
> similar to the wait-wake futexes.
So the benefit of these new fangled futexes is only there for extreme short
critical sections and a gazillion of threads fighting for the same futex,
right?
I really wonder how the average programmer should pick the right flavour,
not to talk about any useful decision for something like glibc to pick the
proper one.
Thanks,
tglx