Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

From: Peter Zijlstra
Date: Fri Oct 04 2024 - 08:35:29 EST


On Fri, Oct 04, 2024 at 04:40:08PM +0530, K Prateek Nayak wrote:
> Hello folks,
>
> On 10/3/2024 11:01 AM, Klaus Kudielka wrote:
> > On Sun, 2024-09-22 at 16:45 +0100, Chris Bainbridge wrote:
> > > On Fri, Aug 30, 2024 at 02:34:56PM +0200, Bert Karwatzki wrote:
> > > > Since linux next-20240820 the following messages appears when booting:
> > > >
> > > > [    T1] smp: Bringing up secondary CPUs ...
> > > > [    T1] smpboot: x86: Booting SMP configuration:
> > > > [    T1] .... node  #0, CPUs:        #2  #4  #6  #8 #10 #12 #14  #1
> > > > This is the line I'm concerend about:
> > > > [    T1] psi: inconsistent task state! task=61:cpuhp/3 cpu=0 psi_flags=4 clear=0 set=4
> > > > [    T1]   #3  #5  #7  #9 #11 #13 #15
> > > > [    T1] Spectre V2 : Update user space SMT mitigation: STIBP always-on
> > > > [    T1] smp: Brought up 1 node, 16 CPUs
> > > > [    T1] smpboot: Total of 16 processors activated (102216.16 BogoMIPS)
> > > >
> > > > I bisected this to commit 152e11f6df29 ("sched/fair: Implement delayed dequeue").
> > > > Is this normal or is this something I should worry about?
> > > >
> > > > Bert Karwatzki
> > >
> > > I am also getting a similar error on boot, and bisected it to the same commit:
> > >
> > > [    0.342931] psi: inconsistent task state! task=15:rcu_tasks_trace cpu=0 psi_flags=4 clear=0 set=4
> > >
> > > #regzbot introduced: 152e11f6df293e816a6a37c69757033cdc72667d
> >
> > Just another data point, while booting 6.12-rc1 on a Turris Omnia:
> >
> > [ 0.000000] Linux version 6.12.0-rc1 (XXX) (arm-linux-gnueabihf-gcc (Debian 14.2.0-1) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP Thu Oct 3 06:59:25 CEST 2024
> > [ 0.000000] CPU: ARMv7 Processor [414fc091] revision 1 (ARMv7), cr=10c5387d
> > [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
> > [ 0.000000] OF: fdt: Machine model: Turris Omnia
> > ...
> > [ 0.000867] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
> > [ 0.000876] psi: inconsistent task state! task=2:kthreadd cpu=0 psi_flags=4 clear=0 set=4
> >
>
> Not sure if someone took a stab at this but I haven't seen the "psi:

I'm aware of the issue, but since it's just statistics and not
anything 'important', I've been spending my time on those crashing bugs.

> inconsistent task state" warning with the below diff. I'm not sure if my
> approach is right which if why I'm pasting the diff before sending out
> an official series. Any comments or testing is greatly appreciated.
>
> The diff is based on:
>
> git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
>
> at commit d4ac164bde7a ("sched/eevdf: Fix wakeup-preempt by checking
> cfs_rq->nr_running")

Thanks, I just pushed all that out to tip/sched/urgent.

> My approach was as follows:
>
> o psi_dequeue() relied on psi_sched_switch() to set the PSI flags
> appropriately for a dequeued task. However, psi_sched_switch() used
> "!task_on_rq_queued(prev)" to judge if the prev task is blocked which
> is now untrue with DELAYED_DEQUEUE. Fix it by checking
> "p->se.sched_delayed" as well. I also added a matching check for
> ENQUEUE_DELAYED for psi_enqueue().

We already determine the whole sleep state earlier, the whole having
called block_task() is a clue, perhaps we should propagate that state
instead of trying to divinate it again.

> o With the above, the warning was put off for a few more seconds but it
> still appeared. I dumped all PSI flag transition along with
> "tsk->se.sched_delayed" to see what trips it and I saw the following
> state changes for the task that finally tripped it:
>
> psi: task state: task=18:rcu_preempt cpu=0 psi_flags=0 clear=0 set=0 delayed=1
> psi: task state: task=18:rcu_preempt cpu=128 psi_flags=0 clear=0 set=4 delayed=1
> psi: task state: task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0
> psi: inconsistent task state! task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0
>
> Note that cpu switched with "tsk->se.sched_delayed" still set which
> got me looking at the task migration path. The warning added below
> in "deactivate_task()" tripped without fail, just before the PSI
> warning was logged.
>
> To prevent migration of a delayed entity (XXX: Is it a good idea?)

It is not. By migrating the entities they can get picked sooner and the
delayed thing gets removed sooner. Less 'hidden' weight.

> we do a "account_task_dequeue()" in the delayed dequeue case to
> remove the task from the "rq->cfs_list", thus removing it from the
> purview of the load balancer.

Anyway, assuming PSI wants to preserve current semantics, does something
like the below work?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..0d766fb9fbc4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2012,7 +2012,7 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);

- if (!(flags & ENQUEUE_RESTORE)) {
+ if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
sched_info_enqueue(rq, p);
psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
}
@@ -2039,7 +2039,7 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq);

- if (!(flags & DEQUEUE_SAVE)) {
+ if (!(flags & DEQUEUE_SAVE) && !p->se.sched_delayed) {
sched_info_dequeue(rq, p);
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}
@@ -6537,6 +6537,7 @@ static void __sched notrace __schedule(int sched_mode)
* as a preemption by schedule_debug() and RCU.
*/
bool preempt = sched_mode > SM_NONE;
+ bool block = false;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
@@ -6622,6 +6623,7 @@ static void __sched notrace __schedule(int sched_mode)
* After this, schedule() must not care about p->state any more.
*/
block_task(rq, prev, flags);
+ block = true;
}
switch_count = &prev->nvcsw;
}
@@ -6667,7 +6669,7 @@ static void __sched notrace __schedule(int sched_mode)

migrate_disable_switch(rq, prev);
psi_account_irqtime(rq, prev, next);
- psi_sched_switch(prev, next, !task_on_rq_queued(prev));
+ psi_sched_switch(prev, next, block);

trace_sched_switch(preempt, prev, next, prev_state);