Re: [PATCH] sched/psi: Streamline the flow in psi_group_change
From: Tvrtko Ursulin
Date: Thu Nov 13 2025 - 10:30:59 EST
On 13/11/2025 15:22, Johannes Weiner wrote:
On Thu, Nov 13, 2025 at 12:22:54PM +0000, Tvrtko Ursulin wrote:
Given that psi_group_change() can be called rather frequently from the
scheduler task switching code lets streamline it a bit to reduce the
number of loops and conditionals on the typical invocation.
First thing is that we replace the open coded mask walks with the standard
for_each_set_bit(). This makes the source code a bit more readable and
also enables usage of the efficient CPU specific zero bit skip
instructions.
In doing so we also remove the need to mask out the special TSK_ONCPU bit
from the set and clear masks, since for_each_set_bit() now directly limits
the array index to the safe range.
As the last remaining step we can now easily move the new state mask
computation to only run when required.
End result is hopefully more readable code and a very small but measurable
reduction in task switching CPU overhead.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Cc: Peter Ziljstra <peterz@xxxxxxxxxxxxx>
Cc: linux-kernel@xxxxxxxxxxxxxxx
---
kernel/sched/psi.c | 48 ++++++++++++++++++++--------------------------
1 file changed, 21 insertions(+), 27 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 59fdb7ebbf22..fe19aeef8dbd 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -798,39 +798,26 @@ static void psi_group_change(struct psi_group *group, int cpu,
u64 now, bool wake_clock)
{
struct psi_group_cpu *groupc;
- unsigned int t, m;
+ unsigned long t, m;
u32 state_mask;
lockdep_assert_rq_held(cpu_rq(cpu));
groupc = per_cpu_ptr(group->pcpu, cpu);
/*
- * Start with TSK_ONCPU, which doesn't have a corresponding
- * task count - it's just a boolean flag directly encoded in
- * the state mask. Clear, set, or carry the current state if
- * no changes are requested.
+ * TSK_ONCPU does not have a corresponding task count - it's just a
+ * boolean flag directly encoded in the state mask. Clear, set, or carry
+ * the current state if no changes are requested.
+ *
+ * The rest of the state mask is calculated based on the task counts.
+ * Update those first, then construct the mask.
*/
- if (unlikely(clear & TSK_ONCPU)) {
- state_mask = 0;
- clear &= ~TSK_ONCPU;
- } else if (unlikely(set & TSK_ONCPU)) {
- state_mask = PSI_ONCPU;
- set &= ~TSK_ONCPU;
- } else {
- state_mask = groupc->state_mask & PSI_ONCPU;
- }
This doesn't look right. Without PSI_ONCPU in state_mask, the results
of test_states() will be bogus, as well as the PSI_MEM_FULL special
case for an active reclaimer on the CPU.
You are completely right, I was sure local state_mask was not used outside the !group->enabled branch but missed it is an input parameter to test_states().
- /*
- * The rest of the state mask is calculated based on the task
- * counts. Update those first, then construct the mask.
- */
- for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
- if (!(m & (1 << t)))
- continue;
- if (groupc->tasks[t]) {
+ m = clear;
+ for_each_set_bit(t, &m, ARRAY_SIZE(groupc->tasks)) {
The current version relies on !!m and doesn't need the range checks
for_each_set_bit() introduces. This seems less efficient. Did you
compare the generated code?
Yes, slightly more .text but empirically it looks a tiny bit fewer cycles. Which I thought was due being able to use the CPU specific optimised __ffs variants. So it still bails on as soon as the last set bit "goes away" just differently.
I will need to redo the tests with the state_mask breakage fixed.
+ if (likely(groupc->tasks[t])) {
groupc->tasks[t]--;
} else if (!psi_bug) {
- printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n",
+ printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%lu tasks=[%u %u %u %u] clear=%x set=%x\n",
cpu, t, groupc->tasks[0],
groupc->tasks[1], groupc->tasks[2],
groupc->tasks[3], clear, set);
@@ -838,9 +825,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
}
}
- for (t = 0; set; set &= ~(1 << t), t++)
- if (set & (1 << t))
- groupc->tasks[t]++;
+ m = set;
+ for_each_set_bit(t, &m, ARRAY_SIZE(groupc->tasks))
+ groupc->tasks[t]++;
if (!group->enabled) {
/*
@@ -853,6 +840,13 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (unlikely(groupc->state_mask & (1 << PSI_NONIDLE)))
record_times(groupc, now);
+ if (unlikely(clear & TSK_ONCPU))
+ state_mask = 0;
+ else if (unlikely(set & TSK_ONCPU))
+ state_mask = PSI_ONCPU;
+ else
+ state_mask = groupc->state_mask & PSI_ONCPU;
You moved it here, but this is the !group->enabled exception
only. What about the common case when the group is enabled?
Yep, I was blind. I will get back to you with v2 if there is still some cpu cycles to be saved.
Regards,
Tvrtko