Re: Re: [RFC PATCH 2/2] sched/fair: Do not specialcase SCHED_IDLE cpus in select slowpath

From: Abel Wu
Date: Tue Mar 11 2025 - 12:43:55 EST


Hi Josh,

On 3/11/25 6:38 AM, Josh Don wrote:
Thanks Abel,

@@ -7481,12 +7481,13 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
}
- } else if (shallowest_idle_cpu == -1 && si_cpu == -1) {
- if (sched_idle_cpu(i)) {
- si_cpu = i;
- continue;
- }
-
+ } else if (shallowest_idle_cpu == -1) {
+ /*
+ * The SCHED_IDLE cpus do not necessarily means anything
+ * to @p due to the cgroup hierarchical behavior. But it
+ * is almost certain that the wakee will get better served
+ * if the cpu is less loaded.
+ */
load = cpu_load(cpu_rq(i));
if (load < min_load) {
min_load = load;

This seems reasonable due to the case you describe. However, I'm
wondering if you considered any heuristics here to help identify when
a target cpu should really be considered sched_idle from the
perspective of the incoming task. For example, in your cgroup
hierarchy, if you have a cpu currently only running tasks in your
besteffort container (and all cpus in the system are busy running
something), then that cpu should be considered as a good target for a
waking task in the "guaranteed" container, and not a good target for a
waking task in the "containerd" container. A simple way to do this

Yes, it actually did cost me several days trying hard to figure out
whether there is a proper way to achieve what you're suggesting.


Solution A
----------

Mark all the hierarchically idle task_groups by assigning a unique
prime, and define a tree walk starts from @n to root that contains
all the preemptable nodes.

struct task_group {
/*
* Set to a unique prime if at least 1 ancestor is idle,
* otherwise set to 1.
*/
u64 prime;
u64 mul;
};

/* Called by sched_group_set_idle() */
static void calculate_mul(struct task_group *tg)
{
struct task_group *curr, *iter;

lockdep_assert_held(&shares_mutex);
for (curr = tg; curr->parent; curr = curr->parent) {
list_for_each_entry(iter, &curr->parent->children, siblings) {
/* Can't preempt non-idle siblings */
if (!iter->idle)
continue;
/*
* For each node in the subtree rooted at @iter do:
* tg->mul *= node->prime;
*/
traverse_subtree(tg, iter);
}
}
}

int sched_idle_cpu(int cpu, struct task_struct *p)
{
/* rq::cached_mul caches rq->donor's tg::mul */
return p->sched_task_group.mul % cpu_rq(cpu)->cached_mul == 0;
}

With this we even can drop find_matching_se(), since it's quite easy
to know whether or not a sched_entity can preempt another. But sadly
task_group::mul will quickly get overflowed.


Solution B
----------

Since we do not require 100% accurate, solution A can be shifted to
using bloom filters.

struct task_group {
u64 key; /* A random number used for hashing */
u64 filter;
};

/* Called by sched_group_set_idle() */
static void calculate_filter(struct task_group *tg)
{
struct task_group *curr, *iter;

lockdep_assert_held(&shares_mutex);
for (curr = tg; curr->parent; curr = curr->parent) {
list_for_each_entry(iter, &curr->parent->children, siblings) {
/* Can't preempt non-idle siblings */
if (!iter->idle)
continue;
/*
* For each node in the subtree rooted at @iter do:
* set_bit(bloom_hash(node->key), &tg->filter);
*/
traverse_subtree(tg, iter);
}
}
}

int sched_idle_cpu(int cpu, struct task_struct *p)
{
u64 filter = p->sched_task_group.filter;
u64 cached = cpu_rq(cpu)->cached_filter;

return filter & cached == cached;
}

False positives are possible, but the possibility can be reduced by
optimizing blooming setup.

I chose the simplest way for now to workaround the issue we encountered,
while I am still trying to do something to get rid of sched_idle_cpu().
Thoughts?

would be to do a find_matching_se on the incoming task and the current
on_cpu task. That does have a drawback of cgroup pointer chasing in a
hot wakeup path, so I don't love it.

Neither do I. But if we go through find_matching_se(), I think we need
this rq locked.


In any case, I'm fine with the change as-is, mostly curious if you
gave any additional thought to the case mentioned above.

Thanks!
Abel