Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

From: Oleg Nesterov
Date: Tue Aug 30 2016 - 08:59:05 EST


On 08/30, Balbir Singh wrote:
>
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state
>
> I have a system with multiple threads and
>
> Most of the threads are stuck doing
>
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915] LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
> [67272.788965] [c000000012857b50] [c00000000028153c] unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
>
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
>
> [c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
> [c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
> [c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
> [c00000037930fc40] c000000000121598 up_write+0x78/0x90
> [c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
> [c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
> [c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
> [c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc
>
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
>
> while (p->on_cpu)
> cpu_relax();
>
> Analysis
>
> The issue is triggered, due to the following race
>
> CPU1 CPU2
>
> while () {
> if (cond)
> break;
> do {
> schedule();
> set_current_state(TASK_UN..)
> } while (!cond);
> rwsem_wake()
> spin_lock_irqsave(wait_lock)
> raw_spin_lock_irqsave(wait_lock) wake_up_process()
> } try_to_wake_up()
> set_current_state(TASK_RUNNING); ..
> list_del(&waiter.list);
>
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
>
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
> wake_up_process()
> try_to_wake_up()
> raw_spin_lock_irqsave(p->pi_lock)
> ..
> if (p->on_rq && ttwu_wakeup())
> ..
> while (p->on_cpu)
> cpu_relax()
> ..
>
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
>
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
>
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
>
> Reproduction of the issue
>
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
>
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Nicholas Piggin <npiggin@xxxxxxxxx>
> Cc: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
>
> Signed-off-by: Balbir Singh <bsingharora@xxxxxxxxx>
> ---
> kernel/sched/core.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2a906f2..582c684 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> success = 1; /* we're going to change ->state */
> cpu = task_cpu(p);
>
> + /*
> + * Ensure we see on_rq and p_state consistently
> + *
> + * For example in __rwsem_down_write_failed(), we have
> + * [S] ->on_rq = 1 [L] ->state
> + * MB RMB
> + * [S] ->state = TASK_UNINTERRUPTIBLE [L] ->on_rq
> + * In the absence of the RMB p->on_rq can be observed to be 0
> + * and we end up spinning indefinitely in while (p->on_cpu)
> + */
> + smp_rmb();

I think the patch is fine... but unless I am totally confused this
is not specific to __rwsem_down_write_failed(). ttwu() can hang the
same way if the target simply does

schedule_timeout();
current->state = TASK_INTERRUPTIBLE;
current->state = TASK_RUNNING;

And. I am not sure I understand where this MB above comes from.

Oleg.