Re: [PATCH] sched/rt.c: pick and check task if double_lock_balance() unlock the rq

From: zhouchengming
Date: Fri Oct 06 2017 - 23:34:59 EST


Hi Steven, Peter,

On 2017/9/26 11:18, Steven Rostedt wrote:
On Tue, 26 Sep 2017 09:23:20 +0800
zhouchengming<zhouchengming1@xxxxxxxxxx> wrote:

On 2017/9/26 3:40, Steven Rostedt wrote:
On Mon, 11 Sep 2017 14:51:49 +0800
Zhou Chengming<zhouchengming1@xxxxxxxxxx> wrote:

push_rt_task() pick the first pushable task and find an eligible
lowest_rq, then double_lock_balance(rq, lowest_rq). So if
double_lock_balance() unlock the rq (when double_lock_balance() return 1),
we have to check if this task is still on the rq.

The problem is that the check conditions are not sufficient:

if (unlikely(task_rq(task) != rq ||
!cpumask_test_cpu(lowest_rq->cpu,&task->cpus_allowed) ||
task_running(rq, task) ||
!rt_task(task) ||
!task_on_rq_queued(task))) {

cpu2 cpu1 cpu0
push_rt_task(rq1)
pick task_A on rq1
find rq0
double_lock_balance(rq1, rq0)
unlock(rq1)
rq1 __schedule
pick task_A run
task_A sleep (dequeued)
lock(rq0)
lock(rq1)
do_above_check(task_A)
task_rq(task_A) == rq1
cpus_allowed unchanged
task_running == false
rt_task(task_A) == true
try_to_wake_up(task_A)
select_cpu = cpu3
enqueue(rq3, task_A)
How can this happen? The try_to_wake_up(task_A) needs to grab the rq
that task A is on, and we have that rq lock.

/me confused.

-- Steve
Thanks for the reply!
After the task_A sleep on cpu1, the try_to_wake_up(task_A) on cpu0 select a different cpu3,
so it will grab the rq3 lock, not the rq1 lock.
Ah crap. This is caused by 7608dec2ce20 ("sched: Drop the rq argument
to sched_class::select_task_rq()"). Because this code depends on
try_to_wake_up() grabbing the task's rq lock. But it no longer does
that, and it causes this race.

OK, I need to look at this deeper when I'm not so jetlagged and typing
this because I can't sleep at 5am.

Thanks for pointing this out!

It may be fixed by simply grabbing the run queue lock on migration, as
that would sync things up.

Is there any new solution? I don't think grabbing the rq lock without the task->pi_lock
will fix this problem. And I think my patch is correct and the changes are small.

Thanks!

Peter?


-- Steve



.