On 11/29/2024 12:28 PM, K Prateek Nayak wrote:
if curr->sched_delayed == 1, the condition: '(curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))'I also see PSI splats like:IIUC, one path for pick_eevdf() to return NULL is:
psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0
but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
the flags it is trying to clear
"(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
possible if you have picked a dequeued entity for running before its
wakeup, which is also perhaps why the "nr_running" computation goes awry
and pick_eevdf() returns NULL (which it should never since
pick_next_entity() is only called when rq->cfs.nr_running is > 0)
pick_eevdf():
<snip>
if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
curr = NULL; <--- curr is set to NULL
"on_rq" is only cleared when the entity is dequeued so "curr" is in fact
going to sleep (proper sleep) and we've reached at pick_eevdf(),
otherwise, if "curr" is not eligible, there is at least one more tasks
on the cfs_rq which implies best has be found and will be non-null.
can be true. Please correct me if wrong.
Thanks for the information.<snip>
found:
if (!best || (curr && entity_before(curr, best)))
best = curr; <--- curr and best are both NULL
Say "curr" is going to sleep, and there is no "best", in which case
"curr" is already blocked and "cfs_rq->nr_running" should be 0 and it
should have not reached pick_eevdf() in the first place since
pick_next_entity() is only called by pick_task_fair() if
"cfs_rq->nr_running" is non-zero.
So as long as "cfs_rq->nr_running" is non-zero, pick_eevdf() should
return a valid runnable entity. Failure to do so perhaps points to
"entity_eligible()" check going sideways somewhere or a bug in
"nr_running" accounting.
Chenyu had proposed a similar fix long back in
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@xxxxxxxxx/
but the consensus was it was covering up a larger problem which
then boiled down to avg_vruntime being computed incorrectly
https://lore.kernel.org/lkml/ZiAWTU5xb%2FJMn%2FHs@chenyu5-mobl2/
From the timeline, it seems the issue is before 152e11f6df29 ("sched/fair: Implement delayed dequeue").
DELAY_DEQUEUE may introduce risk for pick_eevdf() return NULL.
After patch 1 ("Fix warning if NEXT_BUDDY enabled"), the NULL pointer panic disappears.
Patch 2 ("Fix panic if pick_eevdf() returns NULL") is a safe guard.
Thanks,
-adam