Re: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

From: Chengming Zhou
Date: Thu Dec 26 2024 - 23:40:24 EST


On 2024/12/27 12:10, K Prateek Nayak wrote:
Hello there,

[...]

Just made a quick fix and tested passed using your script.

Thank you! The diff seems to be malformed as a result of whitespaces but
I was able to test if by recreating the diff. Feel free to add:

Reported-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Closes: https://lore.kernel.org/lkml/20241226053441.1110-1- kprateek.nayak@xxxxxxx/
Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>

If you can give your sign off, I could add a commit message and send it on
your behalf too.

Great, thanks for your time!

Signed-off-by: Chengming Zhou <chengming.zhou@xxxxxxxxx>



diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e5a6bf587f9..065ac76c47f9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6641,7 +6641,6 @@ static void __sched notrace __schedule(int sched_mode)
          * as a preemption by schedule_debug() and RCU.
          */
         bool preempt = sched_mode > SM_NONE;
-       bool block = false;
         unsigned long *switch_count;
         unsigned long prev_state;
         struct rq_flags rf;
@@ -6702,7 +6701,7 @@ static void __sched notrace __schedule(int sched_mode)
                         goto picked;
                 }
         } else if (!preempt && prev_state) {
-               block = try_to_block_task(rq, prev, prev_state);
+               try_to_block_task(rq, prev, prev_state);
                 switch_count = &prev->nvcsw;
         }

@@ -6748,7 +6747,8 @@ static void __sched notrace __schedule(int sched_mode)

                 migrate_disable_switch(rq, prev);
                 psi_account_irqtime(rq, prev, next);
-               psi_sched_switch(prev, next, block);
+               psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
+                                               prev->se.sched_delayed);

                 trace_sched_switch(preempt, prev, next, prev_state);

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8ee0add5a48a..65efe45fcc77 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -150,7 +150,7 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
                 set = TSK_RUNNING;
                 if (p->in_memstall)
                         set |= TSK_MEMSTALL | TSK_MEMSTALL_RUNNING;
-       } else {
+       } else if (!task_on_cpu(task_rq(p), p)) {

One small nit. here

If the task is on CPU at this point, both set and clear are 0 but
psi_task_change() is still called and I don't see it bailing out if it
doesn't have to adjust any flags.

Yes.


Can we instead just do an early return if task_on_cpu(task_rq(p), p)
returns true? I've tested that version too and I haven't seen any
splats.

I thought it's good to preserve the current flow that:

if (restore)
return;

if (migrate)
...
else if (wakeup)
...

As for early return when `task_on_cpu()`, it looks right to me.
Anyway, it's not a migrate or wakeup from PSI POV.

Thanks!


                 /* Wakeup of new or sleeping task */
                 if (p->in_iowait)
                         clear |= TSK_IOWAIT;