On Wed, 16 Apr 2025 at 11:29, Shrikanth Hegde <sshegde@xxxxxxxxxxxxx> wrote:
On 4/16/25 14:46, Shrikanth Hegde wrote:
On 4/16/25 11:58, Chen, Yu C wrote:
Hi Shrikanth,
On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
On 4/16/25 09:28, Tim Chen wrote:
At load balance time, balance of last level cache domains and
above needs to be serialized. The scheduler checks the atomic var
sched_balance_running first and then see if time is due for a load
balance. This is an expensive operation as multiple CPUs can attempt
sched_balance_running acquisition at the same time.
On a 2 socket Granite Rapid systems enabling sub-numa cluster and
running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
sched_balance_running. Most of the time, a balance attempt is aborted
immediately after acquiring sched_balance_running as load balance time
is not due.
Instead, check balance due time first before acquiring
sched_balance_running. This skips many useless acquisitions
of sched_balance_running and knocks the 7.6% CPU overhead on
sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
improved by 11%.
Hi Tim.
Time check makes sense specially on large systems mainly due to
NEWIDLE balance.
scratch the NEWLY_IDLE part from that comment.
Could you elaborate a little on this statement? There is no timeout
mechanism like periodic load balancer for the NEWLY_IDLE, right?
Yes. NEWLY_IDLE is very opportunistic.
One more point to add, A lot of time, the CPU which acquired
sched_balance_running,
need not end up doing the load balance, since it not the CPU meant to
do the load balance.
This thread.
https://lore.kernel.org/all/1e43e783-55e7-417f-
a1a7-503229eb163a@xxxxxxxxxxxxx/
Best thing probably is to acquire it if this CPU has passed the time
check and as well it is
actually going to do load balance.
This is a good point, and we might only want to deal with periodic load
balancer rather than NEWLY_IDLE balance. Because the latter is too
frequent and contention on the sched_balance_running might introduce
high cache contention.
But NEWLY_IDLE doesn't serialize using sched_balance_running and can
endup consuming a lot of cycles. But if we serialize using
sched_balance_running, it would definitely cause a lot contention as is.
The point was, before acquiring it, it would be better if this CPU is
definite to do the load balance. Else there are chances to miss the
actual load balance.
Sorry, forgot to add.
Do we really need newidle running all the way till NUMA? or if it runs till PKG is it enough?
the regular (idle) can take care for NUMA by serializing it?
- if (sd->flags & SD_BALANCE_NEWIDLE) {
+ if (sd->flags & SD_BALANCE_NEWIDLE && !(sd->flags & SD_SERIALIZE)) {
Why not just clearing SD_BALANCE_NEWIDLE in your sched domain when you
set SD_SERIALIZE
pulled_task = sched_balance_rq(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE,
Anyways, having a policy around this SD_SERIALIZE would be a good thing.
thanks,
Chenyu
Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Reported-by: Mohini Narkhede <mohini.narkhede@xxxxxxxxx>
Tested-by: Mohini Narkhede <mohini.narkhede@xxxxxxxxx>
---
kernel/sched/fair.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..5e5f7a770b2f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct
rq *rq, enum cpu_idle_type idle)
interval = get_sd_balance_interval(sd, busy);
- need_serialize = sd->flags & SD_SERIALIZE;
- if (need_serialize) {
- if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
- goto out;
- }
-
if (time_after_eq(jiffies, sd->last_balance + interval)) {
+ need_serialize = sd->flags & SD_SERIALIZE;
+ if (need_serialize) {
+ if (atomic_cmpxchg_acquire(&sched_balance_running,
0, 1))
+ goto out;
+ }
+
if (sched_balance_rq(cpu, rq, sd, idle,
&continue_balancing)) {
/*
* The LBF_DST_PINNED logic could have changed
@@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
*rq, enum cpu_idle_type idle)
}
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
+ if (need_serialize)
+ atomic_set_release(&sched_balance_running, 0);
}
- if (need_serialize)
- atomic_set_release(&sched_balance_running, 0);
out:
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;