Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil

From: Peter Zijlstra
Date: Thu Nov 29 2018 - 08:12:43 EST



+Cc davidlohr and waiman

On Thu, Nov 29, 2018 at 08:50:30PM +0800, Yongji Xie wrote:
> From: Xie Yongji <xieyongji@xxxxxxxxx>
>
> Our system encountered a problem recently, the khungtaskd detected
> some process hang on mmap_sem. But the odd thing was that one task which
> is not on mmap_sem.wait_list still sleeps in rwsem_down_read_failed().
> Through code inspection, we found a potential bug can lead to this.
>
> Imaging this:
>
> Thread 1 Thread 2
> down_write();
> rwsem_down_read_failed()
> raw_spin_lock_irq(&sem->wait_lock);
> list_add_tail(&waiter.list, &wait_list);
> raw_spin_unlock_irq(&sem->wait_lock);
> __up_write();
> rwsem_wake();
> __rwsem_mark_wake();
> wake_q_add();
> list_del(&waiter->list);
> waiter->task = NULL;
> while (true) {
> set_current_state(TASK_UNINTERRUPTIBLE);
> if (!waiter.task) // true
> break;
> }
> __set_current_state(TASK_RUNNING);
>
> Now Thread 1 is queued in Thread 2's wake_q without sleeping. Then
> Thread 1 call rwsem_down_read_failed() again because Thread 3
> hold the lock, if Thread 3 tries to queue Thread 1 before Thread 2
> do wakeup, it will fail and miss wakeup:
>
> Thread 1 Thread 2 Thread 3
> down_write();
> rwsem_down_read_failed()
> raw_spin_lock_irq(&sem->wait_lock);
> list_add_tail(&waiter.list, &wait_list);
> raw_spin_unlock_irq(&sem->wait_lock);
> __rwsem_mark_wake();
> wake_q_add();
> wake_up_q();
> waiter->task = NULL;
> while (true) {
> set_current_state(TASK_UNINTERRUPTIBLE);
> if (!waiter.task) // false
> break;
> schedule();
> }
> wake_up_q(&wake_q);
>
> In another word, that means we might issue the wakeup before setting the reader
> waiter to nil. If so, the wakeup may do nothing when it was called before reader
> set task state to TASK_UNINTERRUPTIBLE. Then we would have no chance to wake up
> the reader any more, and cause other writers such as "ps" command stuck on it.
>
> This patch is not verified because we still have no way to reproduce the problem.
> But I'd like to ask for some comments from community firstly.

Urgh; so the case where the cmpxchg() fails because it already has a
wakeup in progress, which then 'violates' our expectation of when the
wakeup happens.

Yes, I think this is real, and worse, I think we need to go audit all
wake_q_add() users and document this behaviour.

In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
but I don't think we can easily fix that.

> Signed-off-by: Xie Yongji <xieyongji@xxxxxxxxx>
> Signed-off-by: Zhang Yu <zhangyu31@xxxxxxxxx>
> ---
> kernel/locking/rwsem-xadd.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index 09b1800..50d9af6 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
> woken++;
> tsk = waiter->task;
>
> - wake_q_add(wake_q, tsk);
> + get_task_struct(tsk);
> list_del(&waiter->list);
> /*
> - * Ensure that the last operation is setting the reader
> + * Ensure calling get_task_struct() before setting the reader
> * waiter to nil such that rwsem_down_read_failed() cannot
> * race with do_exit() by always holding a reference count
> * to the task to wakeup.
> */
> smp_store_release(&waiter->task, NULL);
> + /*
> + * Ensure issuing the wakeup (either by us or someone else)
> + * after setting the reader waiter to nil.
> + */
> + wake_q_add(wake_q, tsk);
> + /* wake_q_add() already take the task ref */
> + put_task_struct(tsk);
> }
>
> adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> --
> 2.2.3
>