Re: [PATCH] sched/fair: Fix wakeup_preempt_fair for not waking up task
From: Furkan Çalışkan
Date: Thu Apr 30 2026 - 05:21:39 EST
Hi K Prateek,
On 4/30/26 10:49, K Prateek Nayak wrote:
> Hello Furkan,
>
> On 4/30/2026 11:46 AM, Furkan Çalışkan wrote:
>> On 4/29/26 19:41, Vincent Guittot wrote:
>>> The assumption that p is always enqueued and not delayed, is only true for
>>> wakeup. If p was moved while sched_delayed, pick_next_entity will dequeue
>>> it during the attach and the cfs might become empty.
>>>
>>> Fixes: ac8e69e69363 ("sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue")
>>> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>>> ---
>>>
>>> I have triggered this while running my latency stress test on a new platform.
>>>
>>> kernel/sched/fair.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 728965851842..99fb524c4922 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -9147,7 +9147,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
>>> * Because p is enqueued, nse being null can only mean that we
>>> * dequeued a delayed task.
>>> */
>>> - if (!nse)
>>> + if (!nse && (wake_flags & WF_TTWU))
>>> goto pick;
>>>
>>> if (sched_feat(RUN_TO_PARITY))
>>
>> When a sched_delayed task is migrated (which can only happen via
>> MIGRATE_LOAD per can_migrate_task()), enqueuing it on the dest cpu will
>> call wakeup_preempt_fair immediately, and if the dest cpu is not busy,
>> pick_next_entity() will likely pick and dequeue it immediately. So a
>> wasted enqueue+dequeue pair. Could we skip the enqueue when
>> sched_delayed is set, and defer it to the actual wakeup path?
>
> That requires some considerations - if we are migrating a delayed task
> to an idle CPU, we can readily block the delayed task if we don't have
> other tasks on the migration list.
>
> If the destination is busy, or if we are migrating a bunch of tasks,
> we need to know what the final state of the task_timeline will
> be to make a decision whether it is okay to block them immediately.
>
> We need to know where the avg_vruntime() and deadline ends up to know
> if the task will get picked immediately and we cannot do that without
> going through place_entity + __enqueue_entity().
>
> There is also cgroup implication where, the delayed task might not be
> picked immediately if it is on a cgroup whose entity is not eligible
> and that requires going through the full enqueue + pick.
>
You're right - skipping the enqueue introduces far more complexity than the
cost of the enqueue+dequeue pair it avoids, since it requires reasoning about
the full migration list, destination CPU state, cgroup eligiblity and
avg_vruntime placement.
Thanks for the detailed explanation