Re: [RFC] locking/mutex: Fix starvation of sleeping waiters

From: Jason Low
Date: Mon Jul 18 2016 - 13:48:03 EST


On Mon, 2016-07-18 at 19:15 +0200, Peter Zijlstra wrote:
> On Mon, Jul 18, 2016 at 07:16:47PM +0300, Imre Deak wrote:
> > Currently a thread sleeping on a mutex wait queue can be delayed
> > indefinitely by other threads managing to steal the lock, that is
> > acquiring the lock out-of-order before the sleepers. I noticed this via
> > a testcase (see the Reference: below) where one CPU was unlocking /
> > relocking a mutex in a tight loop while another CPU was delayed
> > indefinitely trying to wake up and get the lock but losing out to the
> > first CPU and going back to sleep:
> >
> > CPU0: CPU1:
> > mutex_lock->acquire
> > mutex_lock->sleep
> > mutex_unlock->wake CPU1
> > wakeup
> > mutex_lock->acquire
> > trylock fail->sleep
> > mutex_unlock->wake CPU1
> > wakeup
> > mutex_lock->acquire
> > trylock fail->sleep
> > ... ...
> >
> > To fix this we can make sure that CPU1 makes progress by avoiding the
> > fastpath locking, optimistic spinning and trylocking if there is any
> > waiter on the list. The corresponding check can be done without holding
> > wait_lock, since the goal is only to make sure sleepers make progress
> > and not to guarantee that the locking will happen in FIFO order.
>
> I think we went over this before, that will also completely destroy
> performance under a number of workloads.

Yup, once a thread becomes a waiter, all other threads will need to
follow suit, so this change would effectively disable optimistic
spinning in some workloads.

A few months ago, we worked on patches that allow the waiter to return
to optimistic spinning to help reduce starvation. Longman sent out a
version 3 patch set, and it sounded like we were fine with the concept.

Jason