Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled

From: K Prateek Nayak
Date: Thu Nov 28 2024 - 23:28:36 EST


Hello Adam,

On 11/29/2024 8:51 AM, Adam Li wrote:
On 11/28/2024 3:29 PM, K Prateek Nayak wrote:
Hello Adam,

Hi Prateek,
Thanks for comments.

On 11/27/2024 11:26 AM, Adam Li wrote:
Enabling NEXT_BUDDY triggers warning, and rcu stall:

[  124.977300] cfs_rq->next->sched_delayed

I could reproduce this with a run of "perf bench sched messaging" but
given that we hit this warning, it also means that either
set_next_buddy() has incorrectly set a delayed entity as next buddy, or
clear_next_buddy() did not clear a delayed entity.

Yes. The logic of this patch is a delayed entity should not be set as next buddy.

I also see PSI splats like:

    psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0

but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
the flags it is trying to clear
"(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
possible if you have picked a dequeued entity for running before its
wakeup, which is also perhaps why the "nr_running" computation goes awry
and pick_eevdf() returns NULL (which it should never since
pick_next_entity() is only called when rq->cfs.nr_running is > 0)
IIUC, one path for pick_eevdf() to return NULL is:
pick_eevdf():
<snip>
if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
curr = NULL; <--- curr is set to NULL

"on_rq" is only cleared when the entity is dequeued so "curr" is in fact
going to sleep (proper sleep) and we've reached at pick_eevdf(),
otherwise, if "curr" is not eligible, there is at least one more tasks
on the cfs_rq which implies best has be found and will be non-null.

<snip>
found:
if (!best || (curr && entity_before(curr, best)))
best = curr; <--- curr and best are both NULL

Say "curr" is going to sleep, and there is no "best", in which case
"curr" is already blocked and "cfs_rq->nr_running" should be 0 and it
should have not reached pick_eevdf() in the first place since
pick_next_entity() is only called by pick_task_fair() if
"cfs_rq->nr_running" is non-zero.

So as long as "cfs_rq->nr_running" is non-zero, pick_eevdf() should
return a valid runnable entity. Failure to do so perhaps points to
"entity_eligible()" check going sideways somewhere or a bug in
"nr_running" accounting.

Chenyu had proposed a similar fix long back in
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@xxxxxxxxx/
but the consensus was it was covering up a larger problem which
then boiled down to avg_vruntime being computed incorrectly
https://lore.kernel.org/lkml/ZiAWTU5xb%2FJMn%2FHs@chenyu5-mobl2/


return best; <--- return NULL

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f..cd1188b7f3df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8748,6 +8748,8 @@ static void set_next_buddy(struct sched_entity *se)
              return;
          if (se_is_idle(se))
              return;
+        if (se->sched_delayed)
+            return;

I tried to put a SCHED_WARN_ON() here to track where this comes from and
seems like it is usually from attach_task() in the load balancing path
pulling a delayed task which is set as the next buddy in
check_preempt_wakeup_fair()

Can you please try the following diff instead of the first two patches
and see if you still hit these warnings, stalls, and pick_eevdf()
returning NULL?

Tested. Run specjbb with NEXT_BUDDY enabled, warnings, stalls and panic disappear.

Thank you for testing. I'll let Peter come back on which approach he
prefers :)


Regards,
-adam

--
Thanks and Regards,
Prateek