I'd be interested in seeing runs where the average number of waiters is 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention.
25 average waiters on compute bound code means the application needs to be rewritten, no amount of mutex tweaking will help it.
Perhaps something NR_CPUS threads would be of more interest?
That seems artificial.
How so? Several real world applications use one thread per CPU to dispatch work to, wait for events, etc.
Does the wakeup code select the spinning waiter, or just a random waiter?
The wakeup code selects the highest priority task in fifo order to wake-up - however, under contention it is most likely going to go back to sleep as another waiter will steal the lock out from under it. This locking strategy is unashamedly about as "unfair" as it gets.
Best to avoid the wakeup if we notice the lock was stolen.
You really can't do this precisely. You can read the futex value at various points along the wakeup path, but at some point you have to commit to waking a task, and you still have a race between the time you wake_up_task() and when it is scheduled and attempts the cmpxchg itself.