Re: Re: [RFC PATCH 1/2] Revert "sched/fair: Make sched-idle CPU selection consistent throughout"
From: Abel Wu
Date: Thu Mar 13 2025 - 03:18:16 EST
Hi Vincent,
On 3/12/25 12:58 AM, Vincent Guittot wrote:
On Mon, 10 Mar 2025 at 08:41, Abel Wu <wuyun.abel@xxxxxxxxxxxxx> wrote:
This reverts commit 17346452b25b98acfb395d2a82ec2e4ad0cb7a01.
The above commit tried to unify selection policy between idle cpus
and SCHED_IDLE ones in fast- and slow-path of select_task_rq_fair()
by treating them equally (although the SCHED_IDLE cpus are turned
to be given more preference in slowpath). The test results seemed
solid, but the setup didn't take cgroup hierarchy into account,
which actually made some of our important services get affected.
The cgroup hierarchy in our production environment looks like below,
which might be common in modern containerized setup:
root
/ \
kubepods system.slice
/ \\ \
guaranteed besteffort containerd
(where 'X=A' means A is SCHED_IDLE cgroup)
The cpu is treated as SCHED_IDLE if only besteffort is running, which
is given at least equal preference as the idle cpus when deciding where
to run a newly woken task. But the SCHED_IDLE cpus do not necessarily
mean they can be preempted soon enough to start serving the wakee, and
Could you give us more details why the SCHED_IDLE cpu which runs only
besteffort can't be preempted soon enough ?
because kubepods vs system.slice is not sched_idle when comparing the
Yes, exactly. What's worse is that kubepods.weight is the sum of all its
children and can easily reach ~3000, while system.weight is default to
100. The weight configuration can be optimized, but it's another topic.
entities ? some maybe the definition of sched_idle_cpu should be fixed
instead
Yes, there are at least two ways to fix it:
1) Let sched_idle_cpu() depends on a specific task, just like Josh
mentioned in the reply of 2nd patch. So if sched_idle_cpu(cpu, p)
returns true, then
a) this rq only contains hierarchical idle tasks, and
b) p can preempt current immediately
Please see my reply to Josh to check the details.
2) Or get rid of sched_idle_cpu() entirely. This needs some careful
rework. Now the users of cfs_rq::h_nr_idle are two parts:
a) select_task_rq, including sched_balance_find_dst_group_cpu()
and select_idle_*(). The former is handled by this series by
simply ignoring sched_idle_cpus, which should be safe since
sched_idle_cpus do not always follow the goal of the slowpath
to find a least loaded cpu to help load balancing. While the
latter is roughly designed as following:
- Must search within target LLC domain, since L3$ miss is
much more costly than L1/L2$
- To use cache more wisely, start searching from the L1/L2$
cache hot cpus like prev/recent_used/..
- Cheers if found an idle cpu that can preempt immediately.
This helps maximize using cpu bandwidth, hence improving
total throughput
- (?)
- Return target anyway, at least it might be cache hot
It could be less optimal if simply remove sched_idle_cpu and
return @target if no idle cpu available, because @target can
be heavy loaded and the cache may not hot any longer when the
wakee finally hit cpu. So in order not to fight with the load
balancing, shall we tiebreak on cpu_load() for the non-idle
cpus?
b) load balance: sched_balance_domains() and dequeue_entities().
We could leave this as-is, but I would propose using h_idle
instead: if the on_cpu task is hierarchically idle when
triggering normal load balancing, then we guess it's a less
loaded cpu and can reduce balance interval. The rationale
behind is that the idle entities usually get very limited
bandwidth if any hierarchically non-idle tasks available.
The heuristics may have false positives which can generally
be divided into 3 cases:
(The numbers represents hierarchical shares%)
C
A B / \
/ \\ / \\ 80 20
99 1 50 50 // \
100 (X)
- Case A) The hierarchical share of h_idle tasks is indeed
small. So in this case they are just get scheduled to take
some breath, and the possibility of false positive is low
enough to be safely ignored.
- Case B) The h_idle & !h_idle tasks equally share bandwidth
which usually means the !h_idle part becomes less loaded
and pulling some load might be preferred.
- Case C) The hierarchical share of h_idle tasks dominates
which usually means their !h_idle parents are allowed to
use a big portion of bandwidth. In this case speedup the
balance is still fine because we could pull some !h_idle
tasks for the most 'important' cgroup.
So the heuristics of using rq->curr's h_idle to judge the need
of pulling (load balancing) seems fine.
And as a result cfs_rq::h_nr_idle can be removed and its maintenance
cost in hotpath can be saved.
Which way do you prefer? It would be much appreciated if you can shed some
light on this.
a sched_idle_cpu should be preempted immediately otherwise it's not a
sched idle cpu and the definition is meaningless
Agree.
Thanks!
Abel