Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled
From: Adam Li
Date: Fri Nov 29 2024 - 02:40:45 EST
On 11/29/2024 12:28 PM, K Prateek Nayak wrote:
>>> I also see PSI splats like:
>>>
>>> psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0
>>>
>>> but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
>>> the flags it is trying to clear
>>> "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
>>> possible if you have picked a dequeued entity for running before its
>>> wakeup, which is also perhaps why the "nr_running" computation goes awry
>>> and pick_eevdf() returns NULL (which it should never since
>>> pick_next_entity() is only called when rq->cfs.nr_running is > 0)
>> IIUC, one path for pick_eevdf() to return NULL is:
>> pick_eevdf():
>> <snip>
>> if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>> curr = NULL; <--- curr is set to NULL
>
> "on_rq" is only cleared when the entity is dequeued so "curr" is in fact
> going to sleep (proper sleep) and we've reached at pick_eevdf(),
> otherwise, if "curr" is not eligible, there is at least one more tasks
> on the cfs_rq which implies best has be found and will be non-null.
>
if curr->sched_delayed == 1, the condition: '(curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))'
can be true. Please correct me if wrong.
>> <snip>
>> found:
>> if (!best || (curr && entity_before(curr, best)))
>> best = curr; <--- curr and best are both NULL
>
> Say "curr" is going to sleep, and there is no "best", in which case
> "curr" is already blocked and "cfs_rq->nr_running" should be 0 and it
> should have not reached pick_eevdf() in the first place since
> pick_next_entity() is only called by pick_task_fair() if
> "cfs_rq->nr_running" is non-zero.
>
> So as long as "cfs_rq->nr_running" is non-zero, pick_eevdf() should
> return a valid runnable entity. Failure to do so perhaps points to
> "entity_eligible()" check going sideways somewhere or a bug in
> "nr_running" accounting.
>
> Chenyu had proposed a similar fix long back in
> https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@xxxxxxxxx/
> but the consensus was it was covering up a larger problem which
> then boiled down to avg_vruntime being computed incorrectly
> https://lore.kernel.org/lkml/ZiAWTU5xb%2FJMn%2FHs@chenyu5-mobl2/
>
Thanks for the information.
>From the timeline, it seems the issue is before 152e11f6df29 ("sched/fair: Implement delayed dequeue").
DELAY_DEQUEUE may introduce risk for pick_eevdf() return NULL.
After patch 1 ("Fix warning if NEXT_BUDDY enabled"), the NULL pointer panic disappears.
Patch 2 ("Fix panic if pick_eevdf() returns NULL") is a safe guard.
Thanks,
-adam