Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled

From: K Prateek Nayak
Date: Fri Nov 29 2024 - 03:01:50 EST


Hello Adam,

On 11/29/2024 1:10 PM, Adam Li wrote:
On 11/29/2024 12:28 PM, K Prateek Nayak wrote:

I also see PSI splats like:

     psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0

but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
the flags it is trying to clear
"(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
possible if you have picked a dequeued entity for running before its
wakeup, which is also perhaps why the "nr_running" computation goes awry
and pick_eevdf() returns NULL (which it should never since
pick_next_entity() is only called when rq->cfs.nr_running is > 0)
IIUC, one path for pick_eevdf() to return NULL is:
pick_eevdf():
<snip>
    if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
        curr = NULL; <--- curr is set to NULL

"on_rq" is only cleared when the entity is dequeued so "curr" is in fact
going to sleep (proper sleep) and we've reached at pick_eevdf(),
otherwise, if "curr" is not eligible, there is at least one more tasks
on the cfs_rq which implies best has be found and will be non-null.

if curr->sched_delayed == 1, the condition: '(curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))'
can be true. Please correct me if wrong.

If curr is sched_delayed, that means it is going to sleep and it is
ineligible but it can only be ineligible if there is at least one more
task with a lower vruntime and hence best must be found. A delayed
entity will also not decrement the "nr_running" and it'll be queued back
from put_prev_entity() to be picked off later.

Essentially, I believe curr can never be ineligible in absence of other
eligible tasks.


<snip>
found:
    if (!best || (curr && entity_before(curr, best)))
        best = curr; <--- curr and best are both NULL

Say "curr" is going to sleep, and there is no "best", in which case
"curr" is already blocked and "cfs_rq->nr_running" should be 0 and it
should have not reached pick_eevdf() in the first place since
pick_next_entity() is only called by pick_task_fair() if
"cfs_rq->nr_running" is non-zero.

So as long as "cfs_rq->nr_running" is non-zero, pick_eevdf() should
return a valid runnable entity. Failure to do so perhaps points to
"entity_eligible()" check going sideways somewhere or a bug in
"nr_running" accounting.

Chenyu had proposed a similar fix long back in
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@xxxxxxxxx/
but the consensus was it was covering up a larger problem which
then boiled down to avg_vruntime being computed incorrectly
https://lore.kernel.org/lkml/ZiAWTU5xb%2FJMn%2FHs@chenyu5-mobl2/

Thanks for the information.
From the timeline, it seems the issue is before 152e11f6df29 ("sched/fair: Implement delayed dequeue").
DELAY_DEQUEUE may introduce risk for pick_eevdf() return NULL.

Ideally it shouldn't since delayed entities are still captured in
"cfs_rq->nr_running" and they'll eventually become eligible and be
picked off but I may be wrong and I hope someone corrects me in which
case :)


After patch 1 ("Fix warning if NEXT_BUDDY enabled"), the NULL pointer panic disappears.
Patch 2 ("Fix panic if pick_eevdf() returns NULL") is a safe guard.

Thanks,
-adam


--
Thanks and Regards,
Prateek