Re: [PATCH] sched/fair: Fix detection of per-CPU kthreads waking a task

From: Vincent Guittot
Date: Thu Nov 25 2021 - 08:34:59 EST


On Thu, 25 Nov 2021 at 12:16, Valentin Schneider
<Valentin.Schneider@xxxxxxx> wrote:
>
> On 25/11/21 10:05, Vincent Guittot wrote:
> > On Wed, 24 Nov 2021 at 16:42, Vincent Donnefort
> > <vincent.donnefort@xxxxxxx> wrote:
> >>
> >> select_idle_sibling() will return prev_cpu for the case where the task is
> >> woken up by a per-CPU kthread. However, the idle task has been recently
> >> modified and is now identified by is_per_cpu_kthread(), breaking the
> >> behaviour described above. Using !is_idle_task() ensures we do not
> >> spuriously trigger that select_idle_sibling() exit path.
> >>
> >> Fixes: 00b89fe0197f ("sched: Make the idle task quack like a per-CPU kthread")
> >> Signed-off-by: Vincent Donnefort <vincent.donnefort@xxxxxxx>
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 945d987246c5..8bf95b0e368d 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6399,6 +6399,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >> * pattern is IO completions.
> >> */
> >> if (is_per_cpu_kthread(current) &&
> >> + !is_idle_task(current) &&
> >> prev == smp_processor_id() &&
> >> this_rq()->nr_running <= 1) {
> >> return prev;
> >
> > AFAICT, this can't be possible for a symmetric system because it would
> > have been already returned by other conditions.
> > Only an asymmetric system can face such a situation if the task
> > doesn't fit which is the subject of your other patch.
> > so this patch seems irrelevant outside the other one
> >
>
> I think you can still hit this on a symmetric system; let me try to
> reformulate my other email.
>
> If this (non-patched) condition evaluates to true, it means the previous
> condition
>
> (available_idle_cpu(target) || sched_idle_cpu(target)) &&
> asym_fits_capacity(task_util, target)
>
> evaluated to false, so for a symmetric system target sure isn't idle.
>
> prev == smp_processor_id() implies prev == target, IOW prev isn't
> idle. Now, consider:
>
> p0.prev = CPU1
> p1.prev = CPU1
>
> CPU0 CPU1
> current = don't care current = swapper/1
>
> ttwu(p1)
> ttwu_queue(p1, CPU1)
> // or
> ttwu_queue_wakelist(p1, CPU1)
>
> hrtimer_wakeup()
> wake_up_process()
> ttwu()
> idle_cpu(CPU1)? no
>
> is_per_cpu_kthread(current)? yes
> prev == smp_processor_id()? yes
> this_rq()->nr_running <= 1? yes
> => self enqueue
>
> ...
> schedule_idle()
>
> This works if CPU0 does either a full enqueue (rq->nr_running == 1) or just
> a wakelist enqueue (rq->ttwu_pending > 0). If there was an idle CPU3
> around, we'd still be stacking p0 and p1 onto CPU1.
>
> IOW this opens a window between a remote ttwu() and the idle task invoking
> schedule_idle() where the idle task can stack more tasks onto its CPU.

Your use case above is out of the scope of this patch and has always
been there, even for other per cpu kthreads. In such case, the wake up
is not triggered by current (idle or another per cpu kthread) but by
an interrupt (hrtimer in your case). If we want to filter wakeup
generated by interrupt context while a per cpu kthread is running, it
would be better to fix all cases and test the running context like
this

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6397,7 +6397,8 @@ static int select_idle_sibling(struct
task_struct *p, int prev, int target)
* essentially a sync wakeup. An obvious example of this
* pattern is IO completions.
*/
- if (is_per_cpu_kthread(current) &&
+ if (!in_interrupt() &&
+ is_per_cpu_kthread(current) &&
prev == smp_processor_id() &&
this_rq()->nr_running <= 1) {
return prev;

>
> >
> >> --
> >> 2.25.1
> >>