Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

From: Joel Fernandes
Date: Tue Mar 18 2025 - 13:01:22 EST


On Tue, Mar 18, 2025 at 06:09:42AM +0100, Andrea Righi wrote:
> On Mon, Mar 17, 2025 at 11:11:08PM +0100, Joel Fernandes wrote:
> >
> >
> > On 3/17/2025 6:30 PM, Andrea Righi wrote:
> > > On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote:
> > >> Hello, Joel.
> > >>
> > >> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
> > >>> Consider that the previous CPU is cache affined to the waker's CPU and
> > >>> is idle. Currently, scx's default select function only selects the
> > >>> previous CPU in this case if WF_SYNC request is also made to wakeup on the
> > >>> waker's CPU.
> > >>>
> > >>> This means, without WF_SYNC, the previous CPU being cache affined to the
> > >>> waker and is idle is not considered. This seems extreme. WF_SYNC is not
> > >>> normally passed to the wakeup path outside of some IPC drivers but it is
> > >>> very possible that the task is cache hot on previous CPU and shares
> > >>> cache with the waker CPU. Lets avoid too many migrations and select the
> > >>> previous CPU in such cases.
> > >> Hmm.. if !WF_SYNC:
> > >>
> > >> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
> > >> core in widening scopes.
> > >>
> > >> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
> > >> idle CPU in widening scopes.
> > >>
> > >> So, it is considering prev_cpu, right? I think it's preferring idle core a
> > >> bit too much - it probably doesn't make sense to cross the NUMA boundary if
> > >> there is an idle CPU in this node, at least.
> > >
> > > Yeah, we should probably be a bit more conservative by default and avoid
> > > jumping across nodes if there are still idle CPUs within the node.
> > >
> >
> > Agreed. So maybe we check for fully idle cores *within the node* first, before
> > preferring idle SMTs *within the node* ? And then, as next step go looking at
> > other nodes. Would that be a reasonable middle ground?
> >
> > > With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce
> > > that while still using the built-in idle policy (since we can specify idle
> > > flags), but that doesn't preclude adjusting the default policy anyway, if
> > > it makes more sense.
> >
> > Aren't you deprecating the usage of the default select function? If we are going
> > to be adjusting its behavior like my patch is doing, then we should probably not
> > also deprecate it.
>
> I'm just extending the default select function to accept a cpumask and idle
> SCX_PICK_IDLE_* flags, so that it's easier for BPF schedulers to change the
> select behavior without reimplementing the whole thing.
>
> The old scx_bpf_select_cpu_dfl() will be remapped to the new API for a
> while for backward compatibility and the underlying selection logic remains
> the same.
>
> So, in this case for example, you could implement the "check full-idle then
> partial-idle SMT CPUs within the node" logic as following:
>
> /* Search for full-idle SMT first, then idle CPUs within prev_cpu's node */
> cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags,
> p->cpus_ptr, SCX_PICK_IDLE_IN_NODE)
> if (cpu < 0) {
> /* Search for full-idle SMT first, then idle CPUs across all nodes */
> cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, p->cpus_ptr, 0)
> }

Thanks, Andrea! I adjusted the default selection as below, hope it looks good
now, will test it more as well. Let me know any comments.

----------------8<-----

From: Joel Fernandes <joelagnelf@xxxxxxxxxx>
Subject: [PATCH] sched/ext: Make default idle CPU selection better

Currently, sched_ext's default CPU selection is roughly something like
this:

1. Look for FULLY IDLE CORES:
1.1. Select prev CPU (wakee) if its CORE is fully idle.
1.2. Or, pick any CPU from fully idle CORE in the L3, then NUMA.
1.3. Or, any idle CPU from fully idle CORE usable by task.
2. Or, use PREV CPU if it is idle.
3. Or any idle CPU in the LLC, NUMA.
4. Or finally any CPU usable by the task.

This can end up select any idle core in the system even if that means
jumping across NUMA nodes (basically 1.3 happens before 3.).

Improve this by moving 1.3 to after 3 (so that skipping over NUMA
happens only later) and also add selection of fully idle target (waker)
core before looking for fully-idle cores in the LLC/NUMA. This is similar to
what FAIR scheduler does.

The new sequence is as follows:

1. Look for FULLY IDLE CORES:
1.1. Select prev CPU (wakee) if its CORE is fully idle.
1.2. Select target CPU (waker) if its CORE is fully idle and shares cache
with prev. <- Added this.
1.3. Or, pick any CPU from fully idle CORE in the L3, then NUMA.
2. Or, use PREV CPU if it is idle.
3. Or any idle CPU in the LLC, NUMA.
4. Or, any idle CPU from fully idle CORE usable by task. <- Moved down.
5. Or finally any CPU usable by the task.

Signed-off-by: Joel Fernandes <joelagnelf@xxxxxxxxxx>
---
kernel/sched/ext.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5a81d9a1e31f..324e442319c7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3558,6 +3558,16 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
goto cpu_found;
}

+ /*
+ * If the waker's CPU shares cache with @prev_cpu and is part
+ * of a fully idle core, select it.
+ */
+ if (cpus_share_cache(cpu, prev_cpu) &&
+ cpumask_test_cpu(cpu, idle_masks.smt) &&
+ test_and_clear_cpu_idle(cpu)) {
+ goto cpu_found;
+ }
+
/*
* Search for any fully idle core in the same LLC domain.
*/
@@ -3575,13 +3585,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
if (cpu >= 0)
goto cpu_found;
}
-
- /*
- * Search for any full idle core usable by the task.
- */
- cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
- if (cpu >= 0)
- goto cpu_found;
}

/*
@@ -3610,6 +3613,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
goto cpu_found;
}

+ /*
+ * Search for any full idle core usable by the task.
+ */
+ if (sched_smt_active()) {
+ cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
+ if (cpu >= 0)
+ goto cpu_found;
+ }
+
/*
* Search for any idle CPU usable by the task.
*/
--
2.43.0