Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

From: Pierre Gondois

Date: Tue Sep 30 2025 - 03:46:04 EST

Hello Hupu,

On 9/10/25 10:43, hupu wrote:

Currently, load_sum to be propagated is estimated from
(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
runnable_avg as an approximation. This approach can introduce precision
loss due to the shift operation, and the error may become more visible
when small tasks frequently enter and leave the queue.

This patch introduces removed.load_sum to directly accumulate
se->avg.load_sum when tasks dequeue, and uses it during load
propagation. By doing so:

a) Avoid relying on runnable_avg-based approximation and obtain
higher precision in load_sum propagation;

(runnable_sum == load_sum) is not exactly accurate anymore since:
static inline long se_runnable(struct sched_entity *se)
{
if (se->sched_delayed)
return false;

return !!se->on_rq;
}

So obtaining load_[sum|avg] from the runnable_avg signal seems compromised.

It is possible to compute load_sum value without the runnable_signal, cf.
40f5aa4c5eae ("sched/pelt: Fix attach_entity_load_avg() corner case")
https://lore.kernel.org/all/20220414090229.342-1-kuyo.chang@xxxxxxxxxxxx/T/#u

I.e.:
+ se->avg.load_sum = se->avg.load_avg * divider;
+ if (se_weight(se) < se->avg.load_sum)
+ se->avg.load_sum = div_u64(se->avg.load_sum, se_weight(se));
+ else
+ se->avg.load_sum = 1;

As a side note, as a counterpart of the above patch, the lower the niceness,
the lower the weight (in sched_prio_to_weight[]) and the lower the task
load signal.
This means that the unweighted load_sum value looses granularity.
E.g.:
A task with weight=15 can have load_avg values in [0:15]. So all the values
for load_sum in the range [X * (47742/15) : (X + 1) * (47742/15)]
are floored to load_avg=X, but load_sum is not reset when computing load_avg.
attach_entity_load_avg() however resets load_sum to X * (47742/15).

b) Eliminate precision loss from the shift operation, especially
with frequent short-lived tasks;

It might also help aggregating multiple tasks. Right now, if 1.000 tasks
with load_sum=1 and load_avg=0 are removed, the rq's load signal will
not be impacted at all.

On the other side, there might also be questions about the PELT windows
of all these tasks and the rq being aligned, cf.
https://lore.kernel.org/all/20170512171336.148578996@xxxxxxxxxxxxx/

c) Reduce one multiplication and shift in the hotpath, which
theoretically lowers overhead (though the impact is minor).

Signed-off-by: hupu <hupu.gm@xxxxxxxxx>
---
kernel/sched/fair.c | 8 +++++---
kernel/sched/sched.h | 1 +
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..b8cf3687804b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct sched_entity *se) {}
static inline int
update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
- unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
+ unsigned long removed_load_sum = 0, removed_load = 0;
+ unsigned long removed_util = 0, removed_runnable = 0;
struct sched_avg *sa = &cfs_rq->avg;
int decayed = 0;
@@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
raw_spin_lock(&cfs_rq->removed.lock);
swap(cfs_rq->removed.util_avg, removed_util);
swap(cfs_rq->removed.load_avg, removed_load);
+ swap(cfs_rq->removed.load_sum, removed_load_sum);
swap(cfs_rq->removed.runnable_avg, removed_runnable);
cfs_rq->removed.nr = 0;
raw_spin_unlock(&cfs_rq->removed.lock);
@@ -4609,8 +4611,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
* removed_runnable is the unweighted version of removed_load so we
* can use it to estimate removed_load_sum.
*/
- add_tg_cfs_propagate(cfs_rq,
- -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
+ add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
decayed = 1;
}
@@ -4792,6 +4793,7 @@ static void remove_entity_load_avg(struct sched_entity *se)
++cfs_rq->removed.nr;
cfs_rq->removed.util_avg += se->avg.util_avg;
cfs_rq->removed.load_avg += se->avg.load_avg;
+ cfs_rq->removed.load_sum += se->avg.load_sum;
cfs_rq->removed.runnable_avg += se->avg.runnable_avg;
raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f7..659935a5c694 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -682,6 +682,7 @@ struct cfs_rq {
struct {
raw_spinlock_t lock ____cacheline_aligned;
int nr;
+ unsigned long load_sum;
unsigned long load_avg;
unsigned long util_avg;
unsigned long runnable_avg;