Re: [PATCH] sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask

From: Valentin Schneider
Date: Mon Sep 11 2023 - 18:24:03 EST


Ok, back to this :)

On 15/08/23 16:21, Sebastian Andrzej Siewior wrote:
> What I still observe is:
> - CPU0 is idle. CPU0 gets a task assigned from CPU1. That task receives
> a wakeup. CPU0 returns from idle and schedules the task.
> pull_rt_task() on CPU1 and sometimes on other CPU observe this, too.
> CPU1 sends irq_work to CPU0 while at the time rto_next_cpu() sees that
> has_pushable_tasks() return 0. That bit was cleared earlier (as per
> tracing).
>
> - CPU0 is idle. CPU0 gets a task assigned from CPU1. The task on CPU0 is
> woken up without an IPI (yay). But then pull_rt_task() decides that
> send irq_work and has_pushable_tasks() said that is has tasks left
> so….
> Now: rto_push_irq_work_func() run once once on CPU0, does nothing,
> rto_next_cpu() return CPU0 again and enqueues itself again on CPU0.
> Usually after the second or third round the scheduler on CPU0 makes
> enough progress to remove the task/ clear the CPU from mask.
>

If CPU0 is selected for the push IPI, then we should have

rd->rto_cpu == CPU0

So per the

cpumask_next(rd->rto_cpu, rd->rto_mask);

in rto_next_cpu(), it shouldn't be able to re-select itself.

Do you have a simple enough reproducer I could use to poke at this?

> I understand that there is a race and the CPU is cleared from rto_mask
> shortly after checking. Therefore I would suggest to look at
> has_pushable_tasks() before returning a CPU in rto_next_cpu() as I did
> just to avoid the interruption which does nothing.
>
> For the second case the irq_work seems to make no progress. I don't see
> any trace_events in hardirq, the mask is cleared outside hardirq (idle
> code). The NEED_RESCHED bit is set for current therefore it doesn't make
> sense to send irq_work to reschedule if the current already has this on
> its agenda.
>
> So what about something like:
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 00e0e50741153..d963408855e25 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2247,8 +2247,23 @@ static int rto_next_cpu(struct root_domain *rd)
>
> rd->rto_cpu = cpu;
>
> - if (cpu < nr_cpu_ids)
> + if (cpu < nr_cpu_ids) {
> + struct task_struct *t;
> +
> + if (!has_pushable_tasks(cpu_rq(cpu)))
> + continue;
> +

IIUC that's just to plug the race between the CPU emptying its
pushable_tasks list and it removing itself from the rto_mask - that looks
fine to me.

> + rcu_read_lock();
> + t = rcu_dereference(rq->curr);
> + /* if (test_preempt_need_resched_cpu(cpu_rq(cpu))) */
> + if (test_tsk_need_resched(t)) {

We need to make sure this doesn't cause us to loose IPIs we actually need.

We do have a call to put_prev_task_balance() through entering __schedule()
if the previous task is RT/DL, and balance_rt() can issue a push
IPI, but AFAICT only if the previous task was the last DL task. So I don't
think we can do this.

> + rcu_read_unlock();
> + continue;
> + }
> + rcu_read_unlock();
> +
> return cpu;
> + }
>
> rd->rto_cpu = -1;
>
> Sebastian