Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
From: K Prateek Nayak
Date: Fri Oct 04 2024 - 07:10:41 EST
Hello folks,
On 10/3/2024 11:01 AM, Klaus Kudielka wrote:
On Sun, 2024-09-22 at 16:45 +0100, Chris Bainbridge wrote:
On Fri, Aug 30, 2024 at 02:34:56PM +0200, Bert Karwatzki wrote:
Since linux next-20240820 the following messages appears when booting:
[ T1] smp: Bringing up secondary CPUs ...
[ T1] smpboot: x86: Booting SMP configuration:
[ T1] .... node #0, CPUs: #2 #4 #6 #8 #10 #12 #14 #1
This is the line I'm concerend about:
[ T1] psi: inconsistent task state! task=61:cpuhp/3 cpu=0 psi_flags=4 clear=0 set=4
[ T1] #3 #5 #7 #9 #11 #13 #15
[ T1] Spectre V2 : Update user space SMT mitigation: STIBP always-on
[ T1] smp: Brought up 1 node, 16 CPUs
[ T1] smpboot: Total of 16 processors activated (102216.16 BogoMIPS)
I bisected this to commit 152e11f6df29 ("sched/fair: Implement delayed dequeue").
Is this normal or is this something I should worry about?
Bert Karwatzki
I am also getting a similar error on boot, and bisected it to the same commit:
[ 0.342931] psi: inconsistent task state! task=15:rcu_tasks_trace cpu=0 psi_flags=4 clear=0 set=4
#regzbot introduced: 152e11f6df293e816a6a37c69757033cdc72667d
Just another data point, while booting 6.12-rc1 on a Turris Omnia:
[ 0.000000] Linux version 6.12.0-rc1 (XXX) (arm-linux-gnueabihf-gcc (Debian 14.2.0-1) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP Thu Oct 3 06:59:25 CEST 2024
[ 0.000000] CPU: ARMv7 Processor [414fc091] revision 1 (ARMv7), cr=10c5387d
[ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[ 0.000000] OF: fdt: Machine model: Turris Omnia
...
[ 0.000867] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[ 0.000876] psi: inconsistent task state! task=2:kthreadd cpu=0 psi_flags=4 clear=0 set=4
Not sure if someone took a stab at this but I haven't seen the "psi:
inconsistent task state" warning with the below diff. I'm not sure if my
approach is right which if why I'm pasting the diff before sending out
an official series. Any comments or testing is greatly appreciated.
The diff is based on:
git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
at commit d4ac164bde7a ("sched/eevdf: Fix wakeup-preempt by checking
cfs_rq->nr_running")
My approach was as follows:
o psi_dequeue() relied on psi_sched_switch() to set the PSI flags
appropriately for a dequeued task. However, psi_sched_switch() used
"!task_on_rq_queued(prev)" to judge if the prev task is blocked which
is now untrue with DELAYED_DEQUEUE. Fix it by checking
"p->se.sched_delayed" as well. I also added a matching check for
ENQUEUE_DELAYED for psi_enqueue().
o With the above, the warning was put off for a few more seconds but it
still appeared. I dumped all PSI flag transition along with
"tsk->se.sched_delayed" to see what trips it and I saw the following
state changes for the task that finally tripped it:
psi: task state: task=18:rcu_preempt cpu=0 psi_flags=0 clear=0 set=0 delayed=1
psi: task state: task=18:rcu_preempt cpu=128 psi_flags=0 clear=0 set=4 delayed=1
psi: task state: task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0
psi: inconsistent task state! task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0
Note that cpu switched with "tsk->se.sched_delayed" still set which
got me looking at the task migration path. The warning added below
in "deactivate_task()" tripped without fail, just before the PSI
warning was logged.
To prevent migration of a delayed entity (XXX: Is it a good idea?)
we do a "account_task_dequeue()" in the delayed dequeue case to
remove the task from the "rq->cfs_list", thus removing it from the
purview of the load balancer.
o With the above change, I only managed to trip the deactivate_task()
WARN_ON() and immediately the PSI warning in the NUMA balancing path
------------[ cut here ]------------
p->se.sched_delayed
WARNING: CPU: 75 PID: 473 at kernel/sched/core.c:2075 deactivate_task+0xa6/0xc0
Modules linked in: ...
CPU: 75 UID: 0 PID: 473 Comm: migration/75 Not tainted 6.12.0-rc1-peterz-sched-urgent-psi-fix+ #32
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
Stopper: multi_cpu_stop+0x0/0x110 <- migrate_swap+0xd7/0x150
RIP: 0010:deactivate_task+0xa6/0xc0
Code: ...
RSP: 0018:ffff9f210e12fdc0 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
RDX: ffff90116ffa18c8 RSI: 0000000000000001 RDI: ffff90116ffa18c0
RBP: ffff8fd2d8559ac0 R08: 0000000000000003 R09: 0000000000000000
R10: 64656863732e6573 R11: 646579616c65645f R12: ffff90116ffb6500
R13: ffff90116ffb6500 R14: 0000000000000004 R15: ffff8fd2f28f433c
FS: 0000000000000000(0000) GS:ffff90116ff80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055748a34e000 CR3: 000000807974e004 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
<TASK>
? __warn+0x88/0x130
? deactivate_task+0xa6/0xc0
? report_bug+0x18e/0x1a0
? prb_read_valid+0x1b/0x30
? handle_bug+0x5b/0xa0
? exc_invalid_op+0x18/0x70
? asm_exc_invalid_op+0x1a/0x20
? deactivate_task+0xa6/0xc0
__migrate_swap_task.part.0+0xbe/0x180
migrate_swap_stop+0x1b6/0x1f0
multi_cpu_stop+0x6e/0x110
? __pfx_multi_cpu_stop+0x10/0x10
cpu_stopper_thread+0x97/0x160
? __pfx_smpboot_thread_fn+0x10/0x10
smpboot_thread_fn+0xdd/0x1d0
kthread+0xd3/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
---[ end trace 0000000000000000 ]---
From some logging, I can say the "dst_task" is the one that is
delayed but I could not go up the stack to find out how it is
chosen for the swap. For this RFC, I just block the delayed
entity in "__migrate_swap_task()" and set the "p->wake_cpu"
to redirect the next wakeup to the appropriate NUMA node.
I haven't encountered any warnings with my machine going for a while now
but I haven't tested any fancy cgroups scenarios yet either; Mileage may
vary :)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..b55b52b081ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2014,7 +2014,9 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_RESTORE)) {
sched_info_enqueue(rq, p);
- psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
+ /* Delayed tasks are considered dequeued by PSI tracking */
+ psi_enqueue(p, ((flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)) ||
+ (flags & ENQUEUE_DELAYED));
}
p->sched_class->enqueue_task(rq, p, flags);
@@ -2069,6 +2071,9 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
SCHED_WARN_ON(flags & DEQUEUE_SLEEP);
+ /* Delayed tasks should not be migrated */
+ SCHED_WARN_ON(p->se.sched_delayed);
+
WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
ASSERT_EXCLUSIVE_WRITER(p->on_rq);
@@ -3298,9 +3303,21 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
struct rq_flags srf, drf;
src_rq = task_rq(p);
- dst_rq = cpu_rq(cpu);
-
rq_pin_lock(src_rq, &srf);
+
+ if (p->se.sched_delayed) {
+ block_task(src_rq, p, DEQUEUE_DELAYED);
+ rq_unpin_lock(src_rq, &srf);
+
+ /*
+ * Make it appear we last ran on the preferred
+ * node. See the comment below.
+ */
+ p->wake_cpu = cpu;
+ return;
+ }
+
+ dst_rq = cpu_rq(cpu);
rq_pin_lock(dst_rq, &drf);
deactivate_task(src_rq, p, 0);
@@ -6667,7 +6684,14 @@ static void __sched notrace __schedule(int sched_mode)
migrate_disable_switch(rq, prev);
psi_account_irqtime(rq, prev, next);
- psi_sched_switch(prev, next, !task_on_rq_queued(prev));
+
+ /*
+ * psi_task_switch() is responsible for clearing TSK_RUNNING
+ * and TSK_IOWAIT which psi_dequeue() skips for a task going
+ * to sleep (see comment there). Consider a delayed entity
+ * as one that has gone to sleep for PSI accounting.
+ */
+ psi_sched_switch(prev, next, !task_on_rq_queued(prev) || prev->se.sched_delayed);
trace_sched_switch(preempt, prev, next, prev_state);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab497fafa7be..cf02d202ab0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3661,18 +3661,40 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
#endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_SMP
+
+static void
+account_task_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ struct rq *rq = rq_of(cfs_rq);
+
+ account_numa_enqueue(rq, task_of(se));
+ list_add(&se->group_node, &rq->cfs_tasks);
+}
+
+static void
+account_task_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+ list_del_init(&se->group_node);
+}
+
+#else
+
+static void
+account_task_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
+static void
+account_task_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
+
+#endif
+
static void
account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_add(&cfs_rq->load, se->load.weight);
-#ifdef CONFIG_SMP
- if (entity_is_task(se)) {
- struct rq *rq = rq_of(cfs_rq);
+ if (entity_is_task(se) && !se->sched_delayed)
+ account_task_enqueue(cfs_rq, se);
- account_numa_enqueue(rq, task_of(se));
- list_add(&se->group_node, &rq->cfs_tasks);
- }
-#endif
cfs_rq->nr_running++;
if (se_is_idle(se))
cfs_rq->idle_nr_running++;
@@ -3682,12 +3704,11 @@ static void
account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_sub(&cfs_rq->load, se->load.weight);
-#ifdef CONFIG_SMP
- if (entity_is_task(se)) {
- account_numa_dequeue(rq_of(cfs_rq), task_of(se));
- list_del_init(&se->group_node);
- }
-#endif
+
+ /* Delayed tasks are already dequeued the first time */
+ if (entity_is_task(se) && !se->sched_delayed)
+ account_task_dequeue(cfs_rq, se);
+
cfs_rq->nr_running--;
if (se_is_idle(se))
cfs_rq->idle_nr_running--;
@@ -6943,6 +6964,10 @@ requeue_delayed_entity(struct sched_entity *se)
update_load_avg(cfs_rq, se, 0);
se->sched_delayed = 0;
+
+ if (entity_is_task(se))
+ account_task_enqueue(cfs_rq, se);
+
}
/*
@@ -7190,10 +7215,18 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
*/
static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
- if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
+ struct sched_entity *se = &p->se;
+
+ if (!(se->sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
util_est_dequeue(&rq->cfs, p);
- if (dequeue_entities(rq, &p->se, flags) < 0) {
+ if (dequeue_entities(rq, se, flags) < 0) {
+ /*
+ * Remove delayed entity from rq->cfs_tasks list
+ * to prevent load balancer from migrating it
+ * away.
+ */
+ account_task_dequeue(cfs_rq_of(se), se);
util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
return false;
}
--
The above changes were arrived at by experimenting. If there are no
obvious objections, I'll send a clean series after some more testing.
Any and all comments are highly appreciated.
--
Thanks and Regards,
Prateek