Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

From: K Prateek Nayak
Date: Fri Oct 04 2024 - 12:44:22 EST


Hello Johannes, Peter,

On 10/4/2024 7:27 PM, Johannes Weiner wrote:
On Fri, Oct 04, 2024 at 02:35:06PM +0200, Peter Zijlstra wrote:
On Fri, Oct 04, 2024 at 04:40:08PM +0530, K Prateek Nayak wrote:
Hello folks,

On 10/3/2024 11:01 AM, Klaus Kudielka wrote:
On Sun, 2024-09-22 at 16:45 +0100, Chris Bainbridge wrote:
On Fri, Aug 30, 2024 at 02:34:56PM +0200, Bert Karwatzki wrote:
Since linux next-20240820 the following messages appears when booting:

[    T1] smp: Bringing up secondary CPUs ...
[    T1] smpboot: x86: Booting SMP configuration:
[    T1] .... node  #0, CPUs:        #2  #4  #6  #8 #10 #12 #14  #1
This is the line I'm concerend about:
[    T1] psi: inconsistent task state! task=61:cpuhp/3 cpu=0 psi_flags=4 clear=0 set=4
[    T1]   #3  #5  #7  #9 #11 #13 #15
[    T1] Spectre V2 : Update user space SMT mitigation: STIBP always-on
[    T1] smp: Brought up 1 node, 16 CPUs
[    T1] smpboot: Total of 16 processors activated (102216.16 BogoMIPS)

I bisected this to commit 152e11f6df29 ("sched/fair: Implement delayed dequeue").
Is this normal or is this something I should worry about?

Bert Karwatzki

I am also getting a similar error on boot, and bisected it to the same commit:

[    0.342931] psi: inconsistent task state! task=15:rcu_tasks_trace cpu=0 psi_flags=4 clear=0 set=4

#regzbot introduced: 152e11f6df293e816a6a37c69757033cdc72667d

Just another data point, while booting 6.12-rc1 on a Turris Omnia:

[ 0.000000] Linux version 6.12.0-rc1 (XXX) (arm-linux-gnueabihf-gcc (Debian 14.2.0-1) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP Thu Oct 3 06:59:25 CEST 2024
[ 0.000000] CPU: ARMv7 Processor [414fc091] revision 1 (ARMv7), cr=10c5387d
[ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[ 0.000000] OF: fdt: Machine model: Turris Omnia
...
[ 0.000867] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[ 0.000876] psi: inconsistent task state! task=2:kthreadd cpu=0 psi_flags=4 clear=0 set=4


Not sure if someone took a stab at this but I haven't seen the "psi:

I'm aware of the issue, but since it's just statistics and not
anything 'important', I've been spending my time on those crashing bugs.

inconsistent task state" warning with the below diff. I'm not sure if my
approach is right which if why I'm pasting the diff before sending out
an official series. Any comments or testing is greatly appreciated.

This fixes the bug for me.

Thank you for testing :)


Anyway, assuming PSI wants to preserve current semantics, does something
like the below work?

This doesn't. But it's a different corruption now:

[ 2.298408] psi: inconsistent task state! task=24:cpuhp/1 cpu=1 psi_flags=10 clear=14 set=0

I hit the same log (clear 14, set 0) and I tried the below changes on
top of Peter's diff:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d766fb9fbc4..9cf3d4359994 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2012,9 +2012,10 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
+ if (!(flags & ENQUEUE_RESTORE) && (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))) {
sched_info_enqueue(rq, p);
- psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
+ psi_enqueue(p, ((flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)) ||
+ (flags & ENQUEUE_DELAYED));
}
p->sched_class->enqueue_task(rq, p, flags);
--

... but it just changes the warning to:

psi: task underflow! cpu=65 t=0 tasks=[0 0 0 0] clear=1 set=4
psi: task underflow! cpu=31 t=0 tasks=[0 0 1 0] clear=1 set=0

Doing a dump_stack(), I see it come from psi_enqueue() and
psi_ttwu_dequeue() and I see "clear=1" as the common theme. I've
stared at it for a while but I'm at a loss currently. If something
jumps out, I'll update here.

Thank you again both for taking a look.

--
Thanks and Regards,
Prateek


It's psi_sched_switch(.sleep=true) trying to clear the running state
and the queued state, but finds only the running state set.

I don't think it's an erroneous dequeue. __schedule() has that
block_task() dequeue before the switch, but the DEQUEUE_SLEEP makes it
a no-op and leaves the combined update to psi_sched_switch().

It looks instead it's missing an enqueue callback. This triggers:

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 020d58967d4e..09a251e3986d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -922,6 +922,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
u64 now = cpu_clock(cpu);
if (next->pid) {
+ WARN_ON_ONCE(!(next->psi_flags & TSK_RUNNING));
+
psi_flags_change(next, 0, TSK_ONCPU);
/*
* Set TSK_ONCPU on @next's cgroups. If @next shares any

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..0d766fb9fbc4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2012,7 +2012,7 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & ENQUEUE_RESTORE)) {
+ if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
sched_info_enqueue(rq, p);
psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
}

... we must be skipping here when we shouldn't.

I tried moving it past ->enqueue_task(), like we did for uclamp, to
get the enqueue when sched_delayed is cleared by the callback and task
is considered properly queued again. However, that results in yet
another problem:

[ 4.624776] psi: inconsistent task state! task=161:systemd-ssh-gen cpu=1 psi_flags=15 clear=14 set=1

This is a psi_sched_switch() trying to clear running|queued and set
iowait. Task is already running|queued|iowait. It looks like we had a
genuine wakeup that was signaled with psi_enqueue(.wakeup=false) (so
it didn't clear the iowait).

@@ -2039,7 +2039,7 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & DEQUEUE_SAVE)) {
+ if (!(flags & DEQUEUE_SAVE) && !p->se.sched_delayed) {
sched_info_dequeue(rq, p);
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}
@@ -6537,6 +6537,7 @@ static void __sched notrace __schedule(int sched_mode)
* as a preemption by schedule_debug() and RCU.
*/
bool preempt = sched_mode > SM_NONE;
+ bool block = false;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
@@ -6622,6 +6623,7 @@ static void __sched notrace __schedule(int sched_mode)
* After this, schedule() must not care about p->state any more.
*/
block_task(rq, prev, flags);
+ block = true;
}
switch_count = &prev->nvcsw;
}
@@ -6667,7 +6669,7 @@ static void __sched notrace __schedule(int sched_mode)
migrate_disable_switch(rq, prev);
psi_account_irqtime(rq, prev, next);
- psi_sched_switch(prev, next, !task_on_rq_queued(prev));
+ psi_sched_switch(prev, next, block);
trace_sched_switch(preempt, prev, next, prev_state);