Re: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

From: Chengming Zhou
Date: Thu Dec 26 2024 - 05:43:34 EST


Hi,

On 2024/12/26 13:34, K Prateek Nayak wrote:
When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:

psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4

When investigating the series of events leading up to the splat,
following sequence was observed:
[008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
...
[008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
[008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
# CPU8 goes into newidle balance and releases the rq lock
...
# CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
[015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)

I have a question here, why TSK_ONCPU is not set in psi_flags if
the task hasn't arrived psi_sched_switch()?

[015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
[008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...

psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, the following race is possible with
psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
psi_sched_switch()

Yeah, this race is introduced by delayed dequeue changes.

In the past, a sleep task can't be migrated or enqueued before it's done in __schedule(). (finish_task(prev) clear prev->on_cpu.)

Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
to bring it schedulable.

But migration is still impossible, since it's still running on this cpu,
so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?

(Actually, there we can enqueue_task() for any sleep task, including
those are not delayed dequeue, if select_task_rq() returns same cpu
as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
later.)


__schedule()
rq_lock(rq)
try_to_block_task(p)
psi_dequeue()
[ psi_task_switch() is responsible
for adjusting the PSI flags ]
put_prev_entity(&p->se) try_to_wake_up(p)
# no runnable task on rq->cfs ...
sched_balance_newidle()
raw_spin_rq_unlock(rq) __task_rq_lock(p)
... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
__task_rq_unlock(p)
raw_spin_rq_lock(rq)
...
[ p was re-enqueued or has migrated away ]

Here ttwu_runnable() call enqueue_task() for delayed dequeue task.

migration can't happen since p->on_cpu is still true.

...
psi_task_switch() [Too late!]
raw_spin_rq_unlock(rq)

The wakeup context will see the flags for a running task when the flags
should have reflected the task being blocked. Similarly, a migration
context in the wakeup path can clear the flags that psi_sched_switch()
assumes will be set (TSK_ONCPU / TSK_RUNNING)

In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
should do nothing at all.

Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
change any state too. (It's not a wakeup or migration from PSI POV.)

And the current code of "psi_sched_switch(prev, next, block);" looks
buggy to me too! The "block" value is from try_to_block_task(), then
pick_next_task() may drop and gain rq lock, so we can't use the stale
value for psi_sched_switch().

Before we used "task_on_rq_queued(prev)", now we have to also consider
delayed dequeue case, so it should be:

"!task_on_rq_queued(prev) || prev->se.sched_delayed"

Thanks!


Since the TSK_ONCPU flag has to be modified with the rq lock of
task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
prevent the race. Specifically:

o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
psi_enqueue() will only be called with TSK_ONCPU set when the task is
being requeued on the same CPU. If the task was migrated,
psi_ttwu_dequeue() would have already cleared the PSI flags.

psi_enqueue() cannot guarantee that this same task will be picked
again when the scheduling CPU returns from newidle balance which is
why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
without migration.

o When psi_sched_switch() observes that prev's task_cpu() has changes or
the TSK_ONCPU flag is not set, a wakeup has raced with the
psi_sched_switch() trying to adjust the dequeue flag. If the next is
same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
already adjusted the PSI flags and no further changes are required
to prev's PSI flags.

With the introduction of DELAY_DEQUEUE, the requeue path is considerably
shortened and with the addition of bandwidth throttling in the
__schedule() path, the race window is large enough to observed this
issue.

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
---
This patch is based on tip:sched/core at commit af98d8a36a96
("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")

Reproducer for the PSI splat:

mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
# Ridiculous limit on SMP to throttle multiple rqs at once
echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
perf bench sched messaging -t -p -l 100000 -g 16

This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
also on a 32 vCPU VM.
---
kernel/sched/core.c | 7 ++++-
kernel/sched/psi.c | 65 ++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/stats.h | 16 ++++++++++-
3 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84902936a620..9bbe51e44e98 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mode)
rq->last_seen_need_resched_ns = 0;
#endif
+ /*
+ * PSI might have to deal with the consequences of newidle balance
+ * possibly dropping the rq lock and prev being requeued and selected.
+ */
+ psi_sched_switch(prev, next, block);
+
if (likely(prev != next)) {
rq->nr_switches++;
/*
@@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode)
migrate_disable_switch(rq, prev);
psi_account_irqtime(rq, prev, next);
- psi_sched_switch(prev, next, block);
trace_sched_switch(preempt, prev, next, prev_state);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84dad1511d1e..c355a6189595 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep)
{
struct psi_group *group, *common = NULL;
- int cpu = task_cpu(prev);
+ int prev_cpu, cpu;
+
+ /* No race between psi_dequeue() and now */
+ if (prev == next && (prev->psi_flags & TSK_ONCPU))
+ return;
+
+ prev_cpu = task_cpu(prev);
+ cpu = smp_processor_id();
if (next->pid) {
+ /*
+ * If next == prev but TSK_ONCPU is cleared, the task was
+ * requeued when newidle balance dropped the rq lock and
+ * psi_enqueue() cleared the TSK_ONCPU flag.
+ */
psi_flags_change(next, 0, TSK_ONCPU);
/*
* Set TSK_ONCPU on @next's cgroups. If @next shares any
@@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
group = task_psi_group(next);
do {
- if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
- PSI_ONCPU) {
+ /*
+ * Since newidle balance can drop the rq lock (see the next comment)
+ * there is a possibility of try_to_wake_up() migrating prev away
+ * before reaching here. Do not find common if task has migrated.
+ */
+ if (prev_cpu == cpu &&
+ (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) {
common = group;
break;
}
@@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
} while ((group = group->parent));
}
+ /*
+ * When a task is blocked, psi_dequeue() leaves the PSI flag
+ * adjustments to psi_task_switch() however, there is a possibility of
+ * rq lock being dropped in the interim and the task being woken up
+ * again before psi_task_switch() is called leading to psi_enqueue()
+ * seeing the flags for a running task. Specifically, the following
+ * scenario is possible:
+ *
+ * __schedule()
+ * rq_lock(rq)
+ * try_to_block_task(p)
+ * psi_dequeue()
+ * [ psi_task_switch() is responsible
+ * for adjusting the PSI flags ]
+ * put_prev_entity(&p->se) try_to_wake_up(p)
+ * # no runnable task on rq->cfs ...
+ * sched_balance_newidle()
+ * raw_spin_rq_unlock(rq) __task_rq_lock(p)
+ * ... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
+ * __task_rq_unlock(p)
+ * raw_spin_rq_lock(rq)
+ * ...
+ * [ p was re-enqueued or has migrated away ]
+ * ...
+ * psi_task_switch() [Too late!]
+ * raw_spin_rq_unlock(rq)
+ *
+ * In the above case, psi_enqueue() can sees the p->psi_flags state
+ * before it is adjusted to account for dequeue in psi_task_switch(),
+ * or psi_ttwu_dequeue() can clear the p->psi_flags which
+ * psi_task_switch() tries to adjust assuming that the entity has just
+ * finished running.
+ *
+ * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use
+ * the combination of TSK_ONCPU and task_cpu(p) to catch the race
+ * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue()
+ * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct
+ * flags already for prev on this CPU, skip adjusting flags.
+ */
+ if (prev == next || prev_cpu != cpu || !(prev->psi_flags & TSK_ONCPU))
+ return;
+
if (prev->pid) {
int clear = TSK_ONCPU, set = 0;
bool wake_clock = true;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8ee0add5a48a..f09903165456 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
if (flags & ENQUEUE_RESTORE)
return;
- if (p->se.sched_delayed) {
+ if (p->psi_flags & TSK_ONCPU) {
+ /*
+ * psi_enqueue() can race with psi_task_switch() where
+ * TSK_ONCPU will be still set for the task (see the
+ * comment in psi_task_switch())
+ *
+ * Reaching here with TSK_ONCPU is only possible when
+ * the task is being enqueued on the same CPU. Since
+ * psi_task_switch() has not had the chance to adjust
+ * the flags yet, just clear the TSK_ONCPU which yields
+ * the same result as sleep + wakeup without migration.
+ */
+ SCHED_WARN_ON(flags & ENQUEUE_MIGRATED);
+ clear = TSK_ONCPU;
+ } else if (p->se.sched_delayed) {
/* CPU migration of "sleeping" task */
SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
if (p->in_memstall)

base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb