Re: [PATCH 3/3] sched/core: Indicate a sched_delayed task was migrated before wakeup

From: K Prateek Nayak
Date: Thu Oct 10 2024 - 12:22:39 EST


Hello Johannes,

On 10/10/2024 6:33 PM, Johannes Weiner wrote:
Hi Prateek,

patches 1 and 2 make obvious sense to me.

On Thu, Oct 10, 2024 at 08:28:38AM +0000, K Prateek Nayak wrote:
@@ -129,6 +129,13 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
if (static_branch_likely(&psi_disabled))
return;
+ /*
+ * Delayed task is not ready to run yet!
+ * Wait for a requeue before accounting.
+ */
+ if (p->se.sched_delayed)
+ return;

This one is problematic. It clears sleeping state (memstall, iowait)
during the dequeue of the migration but doesn't restore it until the
wakeup, which could presumably be much later. This leaves a gap in the
accounting.

psi really wants the dequeue and enqueue of the migration, even when a
task is delay-dequeued. We just have to get the context parsing right
to not confuse migration queues with wakeups.

I see! I was confused in my interpretation of the expectations. Sorry
about that! Can you please give the attached patch a try on top of the
first two patches of series. It survived my (decently stressy) stress
test but it is not as extensive the ones you have :)


I'll try to come up with a suitable solution as well, please don't
apply this one for now.

Thank you Peter for holding off the patches and sorry for the late
surprise!

--
Thanks and Regards,
PrateekFrom 4ad25e7e6a0eb539c5544a81bb3b22ca7cc05fe0 Mon Sep 17 00:00:00 2001
From: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Date: Thu, 10 Oct 2024 15:31:41 +0000
Subject: [PATCH v1.1 3/3] sched/core: Move PSI flags when delayed task is migrated

Since sched_delayed tasks remain on "rq->cfs_tasks" list even after
blocking, they can be migrated from one runqueue to another in a delayed
state by the load balancer. When they are eventually requeued or woken
up on the same CPU via the try_to_wake_up() path, the eventual
activation is clueless about the migration. This trips the PSI
accounting since, in terms of enqueue flags, PSI only considers the
following as a wakeup for PSI accounting:

(flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)

This can lead inconsistent PSI task state splat similar to:

psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=. set=4

Migrate the PSI flags as well when delayed tasks are migrated. Since
psi_dequeue() clears p->psi_flags, use the same to cache the original
PSI flags and requeue it after enqueue during migration.

Note: Delayed tasks will not track TSK_RUNNING or TSK_MEMSTALL_RUNNING
signals despite being queued on the runqueue , however they'll still
track and migrate TSK_IOWAIT and TSK_MEMSTALL signals.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@xxxxxx/
Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@xxxxxxxxxxxxx/
Link: https://lore.kernel.org/lkml/20241010130316.GA181795@xxxxxxxxxxx/
Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
---
Changelog v1..v1.1:

o Alternate approach that migrates the PSI flags too when the delayed
task is migrated by the load balancer.
---
kernel/sched/core.c | 9 ++++++---
kernel/sched/stats.h | 22 ++++++++++++++++++++++
2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 52be38021ebb..c3ac63d3eb9d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2012,10 +2012,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);

- if (!(flags & ENQUEUE_RESTORE)) {
+ if (!(flags & ENQUEUE_RESTORE))
sched_info_enqueue(rq, p);
- psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
- }

p->sched_class->enqueue_task(rq, p, flags);
/*
@@ -2023,6 +2021,11 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
* ->sched_delayed.
*/
uclamp_rq_inc(rq, p);
+ if (!(flags & ENQUEUE_RESTORE)) {
+ /* Tasks can only remain dealyed after an enqueue if they are migrating */
+ WARN_ON_ONCE(p->se.sched_delayed && !task_on_rq_migrating(p));
+ psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
+ }

if (sched_core_enabled(rq))
sched_core_enqueue(rq, p);
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 237780aa3c53..d267a187304c 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -129,6 +129,15 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
if (static_branch_likely(&psi_disabled))
return;

+ if (p->se.sched_delayed) {
+ unsigned int flags = p->psi_flags;
+
+ /* Restore the state saved by psi_dequeue() */
+ p->psi_flags = 0;
+ psi_task_change(p, 0, flags);
+ return;
+ }
+
if (p->in_memstall)
set |= TSK_MEMSTALL_RUNNING;

@@ -145,9 +154,14 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)

static inline void psi_dequeue(struct task_struct *p, bool sleep)
{
+ unsigned int flags = p->psi_flags;
+
if (static_branch_likely(&psi_disabled))
return;

+ /* Delayed task can only be dequeued for migration. */
+ WARN_ON_ONCE(p->se.sched_delayed && sleep);
+
/*
* A voluntary sleep is a dequeue followed by a task switch. To
* avoid walking all ancestors twice, psi_task_switch() handles
@@ -158,6 +172,14 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep)
return;

psi_task_change(p, p->psi_flags, 0);
+
+ /*
+ * Since the delayed entity is migrating, the PSI flags need to
+ * be migrated too. Since p->psi_flags is not cleared, cache the
+ * previous state there for psi_enqueue() to restore.
+ */
+ if (p->se.sched_delayed)
+ p->psi_flags = flags;
}

static inline void psi_ttwu_dequeue(struct task_struct *p)
--
2.34.1