Re: [PATCH] sched/fair: Fix detection of per-CPU kthreads waking a task

From: Dietmar Eggemann
Date: Thu Nov 25 2021 - 08:20:07 EST


On 25.11.21 12:16, Valentin Schneider wrote:
> On 25/11/21 10:05, Vincent Guittot wrote:
>> On Wed, 24 Nov 2021 at 16:42, Vincent Donnefort
>> <vincent.donnefort@xxxxxxx> wrote:
>>>
>>> select_idle_sibling() will return prev_cpu for the case where the task is
>>> woken up by a per-CPU kthread. However, the idle task has been recently
>>> modified and is now identified by is_per_cpu_kthread(), breaking the
>>> behaviour described above. Using !is_idle_task() ensures we do not
>>> spuriously trigger that select_idle_sibling() exit path.
>>>
>>> Fixes: 00b89fe0197f ("sched: Make the idle task quack like a per-CPU kthread")
>>> Signed-off-by: Vincent Donnefort <vincent.donnefort@xxxxxxx>
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 945d987246c5..8bf95b0e368d 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6399,6 +6399,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>> * pattern is IO completions.
>>> */
>>> if (is_per_cpu_kthread(current) &&
>>> + !is_idle_task(current) &&
>>> prev == smp_processor_id() &&
>>> this_rq()->nr_running <= 1) {
>>> return prev;
>>
>> AFAICT, this can't be possible for a symmetric system because it would
>> have been already returned by other conditions.
>> Only an asymmetric system can face such a situation if the task
>> doesn't fit which is the subject of your other patch.
>> so this patch seems irrelevant outside the other one
>>
>
> I think you can still hit this on a symmetric system; let me try to
> reformulate my other email.
>
> If this (non-patched) condition evaluates to true, it means the previous
> condition
>
> (available_idle_cpu(target) || sched_idle_cpu(target)) &&
> asym_fits_capacity(task_util, target)
>
> evaluated to false, so for a symmetric system target sure isn't idle.
>
> prev == smp_processor_id() implies prev == target, IOW prev isn't
> idle. Now, consider:
>
> p0.prev = CPU1
> p1.prev = CPU1
>
> CPU0 CPU1
> current = don't care current = swapper/1
>
> ttwu(p1)
> ttwu_queue(p1, CPU1)
> // or
> ttwu_queue_wakelist(p1, CPU1)
>
> hrtimer_wakeup()
> wake_up_process()
> ttwu()
> idle_cpu(CPU1)? no
>
> is_per_cpu_kthread(current)? yes
> prev == smp_processor_id()? yes
> this_rq()->nr_running <= 1? yes
> => self enqueue
>
> ...
> schedule_idle()
>
> This works if CPU0 does either a full enqueue (rq->nr_running == 1) or just
> a wakelist enqueue (rq->ttwu_pending > 0). If there was an idle CPU3
> around, we'd still be stacking p0 and p1 onto CPU1.
>
> IOW this opens a window between a remote ttwu() and the idle task invoking
> schedule_idle() where the idle task can stack more tasks onto its CPU.

I can see this happening on my Hikey620 (symmetric) when `this = prev =
target`.

available_idle_cpu(target) returns 0. rq->curr is rq->idle but
rq->nr_running is 1.

trace_printk() in sis()' `if (is_per_cpu_kthread(current) &&`
condition.

<idle>-0 [005] this=5 prev=5 target=5 rq->curr=[swapper/5 0] rq->nr_running=1 p=[kworker/u16:3 89] current=[swapper/5 0]
<idle>-0 [007] this=7 prev=7 target=7 rq->curr=[swapper/7 0] rq->nr_running=1 p=[rcu_preempt 11] current=[swapper/7 0]
<idle>-0 [005] this=5 prev=5 target=5 rq->curr=[swapper/5 0] rq->nr_running=1 p=[kworker/u16:1 74] current=[swapper/5 0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
available_idle_cpu(target)->idle_cpu(target)