Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
From: Johannes Weiner
Date: Fri Oct 04 2024 - 10:00:01 EST
On Fri, Oct 04, 2024 at 02:35:06PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 04, 2024 at 04:40:08PM +0530, K Prateek Nayak wrote:
> > Hello folks,
> >
> > On 10/3/2024 11:01 AM, Klaus Kudielka wrote:
> > > On Sun, 2024-09-22 at 16:45 +0100, Chris Bainbridge wrote:
> > > > On Fri, Aug 30, 2024 at 02:34:56PM +0200, Bert Karwatzki wrote:
> > > > > Since linux next-20240820 the following messages appears when booting:
> > > > >
> > > > > [ T1] smp: Bringing up secondary CPUs ...
> > > > > [ T1] smpboot: x86: Booting SMP configuration:
> > > > > [ T1] .... node #0, CPUs: #2 #4 #6 #8 #10 #12 #14 #1
> > > > > This is the line I'm concerend about:
> > > > > [ T1] psi: inconsistent task state! task=61:cpuhp/3 cpu=0 psi_flags=4 clear=0 set=4
> > > > > [ T1] #3 #5 #7 #9 #11 #13 #15
> > > > > [ T1] Spectre V2 : Update user space SMT mitigation: STIBP always-on
> > > > > [ T1] smp: Brought up 1 node, 16 CPUs
> > > > > [ T1] smpboot: Total of 16 processors activated (102216.16 BogoMIPS)
> > > > >
> > > > > I bisected this to commit 152e11f6df29 ("sched/fair: Implement delayed dequeue").
> > > > > Is this normal or is this something I should worry about?
> > > > >
> > > > > Bert Karwatzki
> > > >
> > > > I am also getting a similar error on boot, and bisected it to the same commit:
> > > >
> > > > [ 0.342931] psi: inconsistent task state! task=15:rcu_tasks_trace cpu=0 psi_flags=4 clear=0 set=4
> > > >
> > > > #regzbot introduced: 152e11f6df293e816a6a37c69757033cdc72667d
> > >
> > > Just another data point, while booting 6.12-rc1 on a Turris Omnia:
> > >
> > > [ 0.000000] Linux version 6.12.0-rc1 (XXX) (arm-linux-gnueabihf-gcc (Debian 14.2.0-1) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP Thu Oct 3 06:59:25 CEST 2024
> > > [ 0.000000] CPU: ARMv7 Processor [414fc091] revision 1 (ARMv7), cr=10c5387d
> > > [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
> > > [ 0.000000] OF: fdt: Machine model: Turris Omnia
> > > ...
> > > [ 0.000867] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
> > > [ 0.000876] psi: inconsistent task state! task=2:kthreadd cpu=0 psi_flags=4 clear=0 set=4
> > >
> >
> > Not sure if someone took a stab at this but I haven't seen the "psi:
>
> I'm aware of the issue, but since it's just statistics and not
> anything 'important', I've been spending my time on those crashing bugs.
>
> > inconsistent task state" warning with the below diff. I'm not sure if my
> > approach is right which if why I'm pasting the diff before sending out
> > an official series. Any comments or testing is greatly appreciated.
This fixes the bug for me.
> Anyway, assuming PSI wants to preserve current semantics, does something
> like the below work?
This doesn't. But it's a different corruption now:
[ 2.298408] psi: inconsistent task state! task=24:cpuhp/1 cpu=1 psi_flags=10 clear=14 set=0
It's psi_sched_switch(.sleep=true) trying to clear the running state
and the queued state, but finds only the running state set.
I don't think it's an erroneous dequeue. __schedule() has that
block_task() dequeue before the switch, but the DEQUEUE_SLEEP makes it
a no-op and leaves the combined update to psi_sched_switch().
It looks instead it's missing an enqueue callback. This triggers:
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 020d58967d4e..09a251e3986d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -922,6 +922,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
u64 now = cpu_clock(cpu);
if (next->pid) {
+ WARN_ON_ONCE(!(next->psi_flags & TSK_RUNNING));
+
psi_flags_change(next, 0, TSK_ONCPU);
/*
* Set TSK_ONCPU on @next's cgroups. If @next shares any
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 43e453ab7e20..0d766fb9fbc4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2012,7 +2012,7 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & ENQUEUE_NOCLOCK))
> update_rq_clock(rq);
>
> - if (!(flags & ENQUEUE_RESTORE)) {
> + if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
> sched_info_enqueue(rq, p);
> psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
> }
... we must be skipping here when we shouldn't.
I tried moving it past ->enqueue_task(), like we did for uclamp, to
get the enqueue when sched_delayed is cleared by the callback and task
is considered properly queued again. However, that results in yet
another problem:
[ 4.624776] psi: inconsistent task state! task=161:systemd-ssh-gen cpu=1 psi_flags=15 clear=14 set=1
This is a psi_sched_switch() trying to clear running|queued and set
iowait. Task is already running|queued|iowait. It looks like we had a
genuine wakeup that was signaled with psi_enqueue(.wakeup=false) (so
it didn't clear the iowait).
> @@ -2039,7 +2039,7 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & DEQUEUE_NOCLOCK))
> update_rq_clock(rq);
>
> - if (!(flags & DEQUEUE_SAVE)) {
> + if (!(flags & DEQUEUE_SAVE) && !p->se.sched_delayed) {
> sched_info_dequeue(rq, p);
> psi_dequeue(p, flags & DEQUEUE_SLEEP);
> }
> @@ -6537,6 +6537,7 @@ static void __sched notrace __schedule(int sched_mode)
> * as a preemption by schedule_debug() and RCU.
> */
> bool preempt = sched_mode > SM_NONE;
> + bool block = false;
> unsigned long *switch_count;
> unsigned long prev_state;
> struct rq_flags rf;
> @@ -6622,6 +6623,7 @@ static void __sched notrace __schedule(int sched_mode)
> * After this, schedule() must not care about p->state any more.
> */
> block_task(rq, prev, flags);
> + block = true;
> }
> switch_count = &prev->nvcsw;
> }
> @@ -6667,7 +6669,7 @@ static void __sched notrace __schedule(int sched_mode)
>
> migrate_disable_switch(rq, prev);
> psi_account_irqtime(rq, prev, next);
> - psi_sched_switch(prev, next, !task_on_rq_queued(prev));
> + psi_sched_switch(prev, next, block);
>
> trace_sched_switch(preempt, prev, next, prev_state);