Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

From: K Prateek Nayak
Date: Fri Oct 04 2024 - 13:01:34 EST


Hello Peter,

On 10/4/2024 6:05 PM, Peter Zijlstra wrote:
On Fri, Oct 04, 2024 at 04:40:08PM +0530, K Prateek Nayak wrote:
Hello folks,

On 10/3/2024 11:01 AM, Klaus Kudielka wrote:
On Sun, 2024-09-22 at 16:45 +0100, Chris Bainbridge wrote:
On Fri, Aug 30, 2024 at 02:34:56PM +0200, Bert Karwatzki wrote:
Since linux next-20240820 the following messages appears when booting:

[    T1] smp: Bringing up secondary CPUs ...
[    T1] smpboot: x86: Booting SMP configuration:
[    T1] .... node  #0, CPUs:        #2  #4  #6  #8 #10 #12 #14  #1
This is the line I'm concerend about:
[    T1] psi: inconsistent task state! task=61:cpuhp/3 cpu=0 psi_flags=4 clear=0 set=4
[    T1]   #3  #5  #7  #9 #11 #13 #15
[    T1] Spectre V2 : Update user space SMT mitigation: STIBP always-on
[    T1] smp: Brought up 1 node, 16 CPUs
[    T1] smpboot: Total of 16 processors activated (102216.16 BogoMIPS)

I bisected this to commit 152e11f6df29 ("sched/fair: Implement delayed dequeue").
Is this normal or is this something I should worry about?

Bert Karwatzki

I am also getting a similar error on boot, and bisected it to the same commit:

[    0.342931] psi: inconsistent task state! task=15:rcu_tasks_trace cpu=0 psi_flags=4 clear=0 set=4

#regzbot introduced: 152e11f6df293e816a6a37c69757033cdc72667d

Just another data point, while booting 6.12-rc1 on a Turris Omnia:

[ 0.000000] Linux version 6.12.0-rc1 (XXX) (arm-linux-gnueabihf-gcc (Debian 14.2.0-1) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP Thu Oct 3 06:59:25 CEST 2024
[ 0.000000] CPU: ARMv7 Processor [414fc091] revision 1 (ARMv7), cr=10c5387d
[ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[ 0.000000] OF: fdt: Machine model: Turris Omnia
...
[ 0.000867] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[ 0.000876] psi: inconsistent task state! task=2:kthreadd cpu=0 psi_flags=4 clear=0 set=4


Not sure if someone took a stab at this but I haven't seen the "psi:

I'm aware of the issue, but since it's just statistics and not
anything 'important', I've been spending my time on those crashing bugs.

inconsistent task state" warning with the below diff. I'm not sure if my
approach is right which if why I'm pasting the diff before sending out
an official series. Any comments or testing is greatly appreciated.

The diff is based on:

git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent

at commit d4ac164bde7a ("sched/eevdf: Fix wakeup-preempt by checking
cfs_rq->nr_running")

Thanks, I just pushed all that out to tip/sched/urgent.

My approach was as follows:

o psi_dequeue() relied on psi_sched_switch() to set the PSI flags
appropriately for a dequeued task. However, psi_sched_switch() used
"!task_on_rq_queued(prev)" to judge if the prev task is blocked which
is now untrue with DELAYED_DEQUEUE. Fix it by checking
"p->se.sched_delayed" as well. I also added a matching check for
ENQUEUE_DELAYED for psi_enqueue().

We already determine the whole sleep state earlier, the whole having
called block_task() is a clue, perhaps we should propagate that state
instead of trying to divinate it again.

Yup that makes sense!


o With the above, the warning was put off for a few more seconds but it
still appeared. I dumped all PSI flag transition along with
"tsk->se.sched_delayed" to see what trips it and I saw the following
state changes for the task that finally tripped it:

psi: task state: task=18:rcu_preempt cpu=0 psi_flags=0 clear=0 set=0 delayed=1
psi: task state: task=18:rcu_preempt cpu=128 psi_flags=0 clear=0 set=4 delayed=1
psi: task state: task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0
psi: inconsistent task state! task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0

Note that cpu switched with "tsk->se.sched_delayed" still set which
got me looking at the task migration path. The warning added below
in "deactivate_task()" tripped without fail, just before the PSI
warning was logged.

To prevent migration of a delayed entity (XXX: Is it a good idea?)

It is not. By migrating the entities they can get picked sooner and the
delayed thing gets removed sooner. Less 'hidden' weight.

True that! I was thinking moving queued load could also potentially help
delayed entities being picked faster on the rq where they were delayed.
Both seem to help in one way or the other but I don't have any solid
data to conclusively say which might be better.


we do a "account_task_dequeue()" in the delayed dequeue case to
remove the task from the "rq->cfs_list", thus removing it from the
purview of the load balancer.

Anyway, assuming PSI wants to preserve current semantics, does something
like the below work?

I've updated the details from my testing on the parallel thread by
Johannes.

tl;dr I still see PSI warnings, some more tinkering on top of your
changes altered the warning to "psi: task underflow!". So far, no
luck figuring out how that comes about.

Thank you for taking a look and for the quick patch!

--
Thanks and Regards,
Prateek


---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..0d766fb9fbc4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2012,7 +2012,7 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & ENQUEUE_RESTORE)) {
+ if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
sched_info_enqueue(rq, p);
psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
}
@@ -2039,7 +2039,7 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & DEQUEUE_SAVE)) {
+ if (!(flags & DEQUEUE_SAVE) && !p->se.sched_delayed) {
sched_info_dequeue(rq, p);
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}
@@ -6537,6 +6537,7 @@ static void __sched notrace __schedule(int sched_mode)
* as a preemption by schedule_debug() and RCU.
*/
bool preempt = sched_mode > SM_NONE;
+ bool block = false;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
@@ -6622,6 +6623,7 @@ static void __sched notrace __schedule(int sched_mode)
* After this, schedule() must not care about p->state any more.
*/
block_task(rq, prev, flags);
+ block = true;
}
switch_count = &prev->nvcsw;
}
@@ -6667,7 +6669,7 @@ static void __sched notrace __schedule(int sched_mode)
migrate_disable_switch(rq, prev);
psi_account_irqtime(rq, prev, next);
- psi_sched_switch(prev, next, !task_on_rq_queued(prev));
+ psi_sched_switch(prev, next, block);
trace_sched_switch(preempt, prev, next, prev_state);